Hypothesis Test Exercises from Class, Oct. 12, 2018

Size: px

Start display at page:

Download "Hypothesis Test Exercises from Class, Oct. 12, 2018"

Judith Horton
5 years ago
Views:

1 Hypothesis Test Exercises from Class, Oct. 12, 218 Question 1: Is there a difference in mean sepal length between virsacolor irises and setosa ones? Worked on by Victoria BienAime and Pearl Park Null Hypothesis: µ v µ s = Alternative Hypothesis: µ v µ s > Only looking at data that excludes data from the species Virginica: x<- droplevels(subset(iris, Species!= "virginica")) head(x) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## setosa ## setosa ## setosa ## setosa ## setosa ## setosa Random Distribution for the mean difference in mean Sepal Length for Veriscolor (V) and Setosa (S): y<-do(1)* (diff(mean(sepal.length~shuffle(species), data=x))) gf_histogram(~versicolor, data = y, color = "black", fill = "red") Observed mean difference: versicolor diff(mean(sepal.length~species, data= x)) ## versicolor ##.93 Finding the p-value for the observed mean difference, we need to find the number of rows/cases where the they are greater than or equal to.93: nrow(subset(y, versicolor >=.93)) 1

2 ## [1] This value is divided by the total number of values (n=1) nrow(subset(y, versicolor >=.93))/1 ## [1] So the null hypothesis is rejected because it is off the chart. Question 3: Is there a positive correlation between eruption length and wait time for the Old Faithful geyser? Worked on by Danish and Michael Null Hypothesis: There is no correlation between eruption length and wait time. H : ρ= Alternative Hypothesis: There is a positive correlation between eruption length and wait time. Ha: ρ> cor(eruptions~waiting, data = faithful) ## [1] x <- do(5) * cor(eruptions ~ shuffle(waiting), data=faithful) head(x) ## cor ## ## ## ## ## ## gf_histogram(~cor, data=x, color="black") cor nrow(subset(x, cor>=.9)/5) ## [1] 2

3 Given that P-value is which is smaller than.5, obtaining a value of.98 is not frequent in a world where the null hypothesis is true. So we reject the null hypothesis. Question 4: Is a nonzero correlation even between unrelated data? Worked on by Thomas Scofield In introducing this question, I suggested these commands for generating our lists of numbers. x <- runif(n=5, min=15, max=45) y <- rnorm(n=5, mean=3, sd=5) In choosing the 5 numbers now found in x, it s as if all numbers between 15 and 45 were equally likely, and 5 were chosen at random; whereas the numbers for y were not equally likely, but 3, and numbers nearby, were most likely to be chosen, with the likelihood falling off as a number becomes farther from 3 (falling off like a normal distribution). Here is a scatter plot of the resulting chosen xy-pairs. gf_point(y~x) 4 35 y x The x- and y-coordinates of these plotted points were chosen with no relationship between them. But they will still yield a nonzero sample correlation. cor(y ~ x) ## [1] To see if this test statistic is statistically significant, we obtain a randomization distribution and see how often a result as extreme as this one occurs. Our hypotheses are H : ρ =, H a : ρ. Under this null hypothesis, one is just as likely to see any of the y-values paired with any of the x-values. manycors <- do(1) * cor(y ~ shuffle(x)) head(manycors) ## cor ## ## ##

4 ## ## ## We plot these randomization statistis, shading all that are at least as extreme (on either size of the null value ) as ours. gf_histogram(~cor, data=manycors, color="black", fill=~abs(cor) >=.1855) abs(cor) >=.1855 FALSE TRUE cor Counting occurrences of randomization statistics this extreme, we find the approximate P -value. nrow( subset(manycors, cor >=.1855) ) / 1 ## [1].956 What we have witnessed here, a correlation of.1856, only occurs about 1% of the time when 5 points are chosen with the x- and y- coordinates chosen independently. If we set α =.1 and drew a conclusion, we would reject the null hypothesis and, in this case, would have committed a Type I error, something that happens in 1% of cases with a true null hypothesis and α =.1. Question 5: Do births occur on weekend days in their proper proportion to a full week? Worked on by Kaitlyn Westra, Maddie Lenning and Allyson Prichard H o : p = 2/7 H a : p 2/7 First, we ll find out how many births happened throughout 215. sum(~births,data=births215) ## [1] totalbirths<-sum(~births,data=births215) sunbirths<-sum(~births, data=subset(births215,wday=="sun")) satbirths<-sum(~births, data=subset(births215,wday=="sat")) (sunbirths+satbirths)/totalbirths ## [1]

5 teststat <- (sunbirths+satbirths)/totalbirths That s the proportion of Weekend Births out of the total births. (I feel like that was done in a roundabout way though. I think there s a better and/or more exact way... ) This number is our test statistic. Here s our randomization distribution: We tried several more familiar ways to produce randomization distributions for a single proportion. Using rflip() to flip a weighted coin 3 million times in order to produce a single randomization sample proved excessively slow. The idea of sampling with replacement from a bag, implemented below, was about 5 times faster, but still took a long time when repeated just 1 times. bag <- c(,,,,,1,1) manyprobs <- do(1)*(sum(sample(bag,size=totalbirths,replace=true))/totalbirths) In contrast, a command we haven t previously seen, but one tailored for this very purpose, was lightning fast, even in producing 5 randomization statistics. rbinom(5, size=totalbirths, prob=2/7) / totalbirths A better version, one that it creates a data frame called manyprobs with a column called result, is this one: manyprobs <- data.frame(result = rbinom(5, size=totalbirths, prob=2/7) / totalbirths) gf_histogram(~result, data=manyprobs, color="black") result 2*nrow(subset(manyProbs, result <= teststat)) / 5 ## [1] It looks like our approximate P-value under a 2-sided alternative hypothesis is that number. P-value =.4, so we can reject our H o. Question 6: Are women left-handed at a different rate than men? Worked on by Abena Oduro Loading the required data The prompt I chose to use was prompt #6 which reads Are women left-handed at a different rate than men?. To load the data from a comma separated values file, I used the read.csv() command, and saved it as hands. However, the Selfhandedness column in this dataset had some empty values, so I used the 5

6 droplevels(subset()) command to remove those and focus specifically and those that had the values R for right-handedness and L for left-handedness. I saved these under handy. hands<- read.csv(" handy<-droplevels(subset(hands,selfhandedness=="l" selfhandedness=="r")) Calculating the test statistic For this prompt, I expected the null hypothesis to be H : p D =, where p D is p m -p f. This can be interpreted as the difference between the proportion of males that are left-handed (p m )nminus the proportion of females that are lefthanded (p f ) in the population is equal. I expected the alternative hypothesis to be H a : p D. This can be interpreted as the difference between the proportion of males and females that are left handed in the population are not equal. For all cases, p m -p f can be summarized as p D. To calculate the test statistic,p D I first used the tally() command to find the numbers of males and females that were right or left handed. Then, using the prop() command and success== L I narrowed down the results to the proportion of males and females who were left handed only. This made data specific to the question Are women left-handed at a different rate than men? Finally to calculate x D, I used the diff()command to find the difference between the two proportions, and it was calculated to be x D = tally(selfhandedness~gender, data=handy) ## gender ## selfhandedness F M ## L ## R prop(selfhandedness~gender, data=handy, success="l") ## prop_l.f prop_l.m ## diff(prop(selfhandedness~gender, data=handy, success="l")) ## prop_l.m ## Creating a Randomization Distribution To create a randomization distribution, I used the same diff(prop()) command with success= L, but this time, I used shuffle(gender) to tell R to assign random values of gender to different values of selfhandedness. This was done create a set of values under which the null hypothesis is true. Using the do(5) command, I created 5 samples under the null condition and saved them under lefty. To view the distribution of these samples (p D ), I used the command gf_histogram to generate a histogram of the values (p) saved in lefty. As expected, it was centered around the null value, O. lefty<-do(5)*diff(prop(selfhandedness~shuffle(gender), data=handy, success="l")) gf_histogram(~prop_l.m, data=lefty, color="black",fill="white") 6

7 6 4 2 Calculating the P-Value prop_l.m To calculate the P-value of the original test statistic, x D = , I used the nrow(subset()) command to find the number of values in the distribution that were above Since the H a implies a two-tailed test, I added the number of values in the distribution that were below Then, I divided the whole command by 5, which is the number of samples in the distribution. This was done to find the proportion of the 5 samples which were as extreme or more extreme than to the right of and to the left of to give the p-value. The p-value was calculated to be 1. (nrow( subset(lefty, prop_l.m >= ))+nrow( subset(lefty, prop_l.m <= )))/5 ## [1] 1 Question 7: Is a male student at Calvin College typical when it comes to height? Worked on by Daniel Sculley and Matthew Vos H : µ h = 7 H a : µ h 7 Significance Threshold: α :.5 First: We loaded the data set we will be analyzing and pulled out the all male data set. y <- read.csv(" CalvinStats <- subset(y, gender=="m") Next: We calculated the test statistic mean mean(~height, data=calvinstats, na.rm=true) ## [1] Test statistic x h : We will use this to find the amount we need to add to every data point to generate our randomization distribution: ## [1]

8 Thus we add: to every height on the data set Next: We generate a randomization distribution x <- do(5)*mean(~resample(height ), data=calvinstats, na.rm=true) gf_histogram(~mean, data=x, color="black") mean Finally: We calculate our p-value for the data set (nrow(subset(x, mean >= ))/5)*2 ## [1] As there are no values as extreme as our test statistic in the randomization distribution, it is safe to say that it is significant at the.5 level. File creation date: Editor: Thomas Scofield 8

Introduction to Hypothesis Testing T.Scofield 10/03/2016

Introduction to Hypothesis Testing T.Scofield 10/03/016 Hypothesis Testing: the steps 1. Identify the research question, along with relevant variables.. Formulate hypotheses (null and alternative) appropriate