Assignment 5.5. Nothing here to hand in

Size: px

Start display at page:

Download "Assignment 5.5. Nothing here to hand in"

Hannah Green
5 years ago
Views:

1 Assignment 5.5 Nothing here to hand in Load the tidyverse before we start: library(tidyverse) ## Loading tidyverse: ggplot2 ## Loading tidyverse: tibble ## Loading tidyverse: tidyr ## Loading tidyverse: readr ## Loading tidyverse: purrr ## Loading tidyverse: dplyr ## Conflicts with tidy packages ## filter(): dplyr, stats ## lag(): dplyr, stats 1. Can students throw a baseball farther than a softball? A statistics class, containing 24 students, went out to a football field to try to answer this question. Each student warmed up and then threw each type of ball as far as they could. The order of ball types was randomized: some students threw the baseball first, and some threw the softball first. (A softball is bigger than a baseball, so we might expect that a softball would be harder to throw a long way than a baseball.) The data are in http: // in three columns: the first is a number identifying the student, the second is the distance thrown with the baseball (in yards) and the third is the distance thrown with the softball (also in yards). (a) Read the data into SAS. There are no column headers, which you ll need to take into account. Solution: The file extension suggests that the data values are separated by spaces, which is correct, but there are no variable names, so getnames=no: filename myurl url " proc import datafile=myurl dbms=dlm out=throw replace; delimiter=' '; getnames=no; There are no variable names, so SAS had to invent some: proc print; 1

2 Obs VAR1 VAR2 VAR The data values look OK, and there are correctly 24 rows. The column names are VAR1, the student IDs, VAR2, the distance thrown with a baseball, and VAR3, the distance thrown with a softball. (b) Calculate a column of differences, baseball minus softball. Solution: Remember how SAS wants you to do this: create a new data set, copy in everything from the previous one, and then create your new variable. Don t forget to use SAS s variable names: data throw2; set throw; diff=var2-var3; and for completeness check that it worked, bearing in mind that the most-recently created data set is the new one, throw2, so this will do the right thing: proc print; Obs VAR1 VAR2 VAR3 diff Page

3 which it did. (c) Make a normal quantile plot of the differences. On your plot, add a line (using a µ and σ estimated from the data). What do you conclude from the plot, and thus why would a sign test be more appropriate than a matched-pairs t-test? Solution: This kind of thing: with result proc univariate noprint; qqplot diff / normal(mu=est sigma=est); These differences are mostly normal, except for the outlier at the upper end. The outlier makes us doubt normality, which is assumed for a t-test, so a sign test would be more appropriate. (d) Think about how you would use a sign test in this matched-pairs situation. Run an appropriate sign test in SAS, bearing in mind the null and alternative hypotheses that you wish to test. What do you conclude, in the context of the data? Solution: In the matched-pairs context, our null hypothesis is that there is no difference between how far students can throw a baseball and a softball: that is, that the median difference is zero. We wanted to see whether students can throw a baseball further on average than a softball: that is, whether the median difference is greater than zero (the way around I calculated it: if you did softball minus baseball, the median difference would be less than zero). Thus the SAS code is something like this: proc univariate mu0=0; Page 3

4 var diff; This will get us, remember, a two-sided test: The UNIVARIATE Procedure Variable: diff Tests for Location: Mu0=0 Test -Statistic p Value Student's t t Pr > t Sign M 9.5 Pr >= M <.0001 Signed Rank S Pr >= S <.0001 The two-sided P-value is less than But we wanted a one-sided P-value, for testing that the median difference is greater than zero. So we ought first to check that the median difference in the sample is greater than zero, which is also on the proc univariate output: Basic Statistical Measures Location Variability Mean Std Deviation Median Variance Mode Range Interquartile Range Note: The mode displayed is the smallest of 3 modes with a count of 3. The median difference is 5, so we are on the correct side, and our one-sided P-value is half the twosided one, less than This is definitely small enough to reject the null with, and we can conclude that students really can throw a baseball farther than a softball. For a complete answer, you need in your discussion to say that SAS s P-value is two-sided and we need a one-sided one. Simply halving the two-sided one is not the best (you really ought to convince yourself that you are on the correct side ), but is acceptable. An answer simply using SAS s P-value, even though less than is the right answer, is not the right answer for the right reason, and so is incomplete. (e) Read the same data into R. You ll need to supply some names to the columns. Solution: This kind of thing: Page 4

5 myurl=" throws=read_delim(myurl," ",col_names=c("student","baseball","softball")) ## Parsed with column specification: ## cols( ## student = col integer(), ## baseball = col integer(), ## softball = col integer() ## ) throws ## # A tibble: 24 x 3 ## student baseball softball ## <int> <int> <int> ## ## ## ## ## ## ## ## ## ## ## #... with 14 more rows This is one of those times where we have to tell R what names to give the columns. Or you can put col names=f and leave the columns called X1, X2, X3 or whatever they end up as. (f) Calculate a column of differences, baseball minus softball, in the data frame. Solution: Add it to the data frame using mutate: throws2=throws %>% mutate(diff=baseball-softball) throws2 ## # A tibble: 24 x 4 ## student baseball softball diff ## <int> <int> <int> <int> ## ## ## ## ## ## ## ## ## ## ## #... with 14 more rows (g) Carry out a sign test in R, testing the null hypothesis that the median difference is zero, against Page 5

6 the alternative that it is greater than zero. Obtain a P-value and compare it with the one you got from SAS. Your option whether you use smmr or not. Solution: I think using smmr is way easier, so I ll do that first. There is even a shortcut in that the null median defaults to zero, which is exactly what we want here: library(smmr) sign_test(throws2,diff) ## $above_below ## below above ## 2 21 ## ## $p_values ## alternative p_value ## 1 lower e-01 ## 2 upper e-05 ## 3 two-sided e-05 We want, this time, the upper-tailed one-sided test, since we want to prove that students can throw a baseball a longer distance than a softball. Thus the P-value we want is To build it yourself, you know the steps by now. First step is to count how many differences are greater and less than zero: table(throws2$diff>0) ## ## FALSE TRUE ## 3 21 or table(throws2$diff<0) ## ## FALSE TRUE ## 22 2 or, since we have things in a data frame, throws2 %>% count(diff>0) ## # A tibble: 2 x 2 ## `diff > 0` n ## <lgl> <int> ## 1 FALSE 3 ## 2 TRUE 21 or count those less than zero. I d take any of those. Note that these are not all the same. One of the differences is in fact exactly zero. The technically right thing to do with the zero difference is to throw it away (leaving 23 differences with 2 negative and 21 positive). I would take that, or 2 or 3 negative differences out of 24 (depending on whether you count greater than zero or less than zero ). We hope that this won t make a material difference to the P-value; it ll make some difference, but won t (we hope) change the conclusion about whether to reject. Page 6

7 Second step is to get a P-value for whichever one of those you got, from the appropriate binomial distribution. The P-value is the probability of getting 21 (or 22) positive differences out of 24 (or 23) or more, since this is the end of the distribution we should be at if the alternative hypothesis is correct. Thus any of these will get you a defensible P-value: sum(dbinom(21:23,23,0.5)) ## [1] e-05 sum(dbinom(22:24,24,0.5)) ## [1] e-05 sum(dbinom(21:24,24,0.5)) ## [1] sum(dbinom(0:2,23,0.5)) ## [1] e-05 sum(dbinom(0:2,24,0.5)) ## [1] e-05 sum(dbinom(0:3,24,0.5)) ## [1] The first and fourth of those are the same as smmr (throwing away the exactly-median value). SAS s P-value was less than (remember, half of the one on the output). SAS actually does something else if there are values exactly equal to the median: it counts them as half above and half below. 1 If you got the last of those P-values, you ought to remark that it s slightly greater than the one SAS produced. As we hoped, there is no material difference here: there is no doubt with any of these possibilities that we will reject a median difference of zero in favour of a median difference greater than zero. 2. Previously, you carried out a sign test to determine whether students could throw a baseball farther than a softball. This time, we will calculate a confidence interval for the median difference baseball minus softball, using the results of sign tests. (a) Read the data into R from giving appropriate names to the columns, and add a column of differences. Solution: Of course, you can copy this from my solutions, which is fine since they are already public. Any way that works is OK, including tidyverse ideas, but I did it this way, combining the reading of the data with the calculation of the differences in one pipe: Page 7

8 myurl=" throws = read_delim(myurl," ",col_names=c("student","baseball","softball")) %>% mutate(diff=baseball-softball) ## Parsed with column specification: ## cols( ## student = col integer(), ## baseball = col integer(), ## softball = col integer() ## ) throws ## # A tibble: 24 x 4 ## student baseball softball diff ## <int> <int> <int> <int> ## ## ## ## ## ## ## ## ## ## ## #... with 14 more rows (b) What function in smmr will run a two-sided sign test and return only the P-value? Check that it works by testing whether the median difference for your data is zero or different from zero. Solution: It s called pval sign. If you haven t run into it before, in R Studio click on Packages, find smmr, and click on its name. This will bring up package help, which includes a list of all the functions in the package, along with a brief description of what each one does. (Clicking on a function name brings up the help for that function.) Let s check that it works properly by repeating the previous sign test and verifying that pval sign gives the same thing: sign_test(throws,diff,0) ## $above_below ## below above ## 2 21 ## ## $p_values ## alternative p_value ## 1 lower e-01 ## 2 upper e-05 ## 3 two-sided e-05 pval_sign(0,throws,diff) ## [1] e-05 Page 8

9 The P-values are the same (for the two-sided test) and both small, so the median difference is not zero. (c) Based on your P-value, do you think 0 is inside the confidence interval or not? Explain briefly. Solution: Absolutely not. The median difference is definitely not zero, so zero cannot be in the confidence interval. Our suspicion, from the one-sided test from earlier, is that the differences were mostly positive (people could throw a baseball farther than a softball, in most cases). So the confidence interval ought to contain only positive values. I ask this because it drives what happens below. (d) Obtain a 95% confidence interval for the population median difference, baseball minus softball, using a trial-and-error procedure that determines whether a number of possible medians are inside or outside the CI. Solution: I ve given you a fair bit of freedom to tackle this as you wish. Anything that makes sense is good: whatever mixture of mindlessness, guesswork and cleverness that you want to employ. The most mindless way to try some values one at a time and see what you get, eg.: pval_sign(1,throws,diff) ## [1] pval_sign(5,throws,diff) ## [1] So median 1 is outside and median 5 is inside the 95% interval. Keep trying values until you ve figured out where the lower and upper ends of the interval are: where the P-values cross from below 0.05 to above, or vice versa. Something more intelligent is to make a long list of potential medians, and get the P-value for each of them, eg.: my.med=seq(0,20,2) pvals=map_dbl(my.med,pval_sign,throws,diff) data.frame(my.med,pvals) ## my.med pvals ## e-05 ## e-02 ## e-01 ## e-01 ## e-01 ## e-02 ## e-03 ## e-05 ## e-05 ## e-06 ## e-06 2 is just inside the interval, 8 is also inside, and 10 is outside. Some closer investigation: Page 9

10 my.med=seq(0,2,0.5) pvals=map_dbl(my.med,pval_sign,throws,diff) data.frame(my.med,pvals) ## my.med pvals ## e-05 ## e-04 ## e-03 ## e-02 ## e-02 The bottom end of the interval actually is 2, since 2 is inside and 1.5 is outside. my.med=seq(8,10,0.5) pvals=map_dbl(my.med,pval_sign,throws,diff) data.frame(my.med,pvals) ## my.med pvals ## ## ## ## ## The top end is 9, 9 being inside and 9.5 outside. Since the data values are all whole numbers, I think this is accurate enough. The most sophisticated way is the bisection idea we saw before. We already have a kickoff for this, since we found, mindlessly, that 1 is outside the interval on the low end and 5 is inside, so the lower limit has to be between 1 and 5. Let s try halfway between, ie. 3: pval_sign(3,throws,diff) ## [1] Inside, so lower limit is between 1 and 3. This can be automated, thus: lo=1 hi=3 while(abs(hi-lo)>0.1) { try=(lo+hi)/2 ptry=pval_sign(try,throws,diff) if (ptry>0.05) { hi=try } else { lo=try } } c(lo,hi) ## [1] The difficult bit is to decide whether the value try becomes the new lo or the new hi. If the P-value for the median of try is greater than 0.05, try is inside the interval, and it becomes the new hi; otherwise it s outside and becomes the new lo. Whatever the values are, lo is always outside the interval and hi is always inside, and they move closer and closer to each other. Page 10

11 At the other end of the interval, lo is inside and hi is outside, so there is a little switching around within the loop. For starting values, you can be fairly mindless: for example, we know that 5 is inside and something big like 20 must be outside: lo=5 hi=20 while(abs(hi-lo)>0.1) { try=(lo+hi)/2 ptry=pval_sign(try,throws,diff) if (ptry>0.05) { lo=try } else { hi=try } } c(lo,hi) ## [1] The interval goes from 2 to (as calculated here) just under 9. Of course, smmr is much easier: ci_median(throws,diff) ## [1] This uses the bisection method with a smaller tolerance than we did, so the answer is more accurate. It looks as if the interval goes from 2 to 9: that is, students can throw a baseball on average between 2 and 9 feet further than they can throw a softball. 3. Previously, we looked at a parking survey designed to address whether men or women were better at parallel parking. Let s revisit these data, and see what might be a better test that the two-sample t-test we did before. The data were in Read the data into R, the same way that you did it before (assuming that it worked for you then). Solution: This is an Excel spreadsheet, so you need to do something like this: Page 11

12 library(readxl) parking=read_excel("parking.xlsx",sheet=2) parking ## # A tibble: 93 x 2 ## distance gender ## <dbl> <chr> ## male ## male ## male ## male ## male ## male ## male ## male ## male ## male ## #... with 83 more rows (a) Make, or re-make, a plot that will help you assess the assumptions of the two-sample t-test. Why do you have doubts about the two-sample t-test? Solution: The no-thinking plot is to note that you have one quantitative variable distance and one categorical one gender, and so a side-by side boxplot is the way to go: ggplot(parking,aes(x=gender,y=distance))+geom_boxplot() Page 12

13 distance female gender male The assumption behind the two-sample t-test is that both groups have approximately normal distributions. I think that fails here, because both distributions have outliers, or are skewed to the right (depending on the way you look at it). Thinking further about normality, you might consider that normal quantile plots, one for each group, would be the thing. This comes out nicely in ggplot with facets, once you get your head around what you need to do: ggplot(parking,aes(sample=distance))+stat_qq()+ facet_wrap(~gender,ncol=1) Page 13

14 50 female sample male theoretical First we make a plot that produces the right kind of thing (stat qq requires a sample of data) but for all the data together, and then, at the end, we produce a separate plot for each gender. I added one extra thing here: the default layout for two plots is to put them left and right, which makes them look tall and skinny (and hard to interpret), so I d rather put them above and below. One way to arrange this is with facet grid; this way is another, arranging all the subplots in an array with one column. So what are those plots showing us? For the males (at the bottom), I think the principal feature is the outlier at the top end; the other points are more or less straight. For the females (at the top), there is more of a curve, with values bunched up at the bottom and spread out at the top: skewed to the right. Or, you might say, the lowest values, the ones below 1 on the x-axis, are too bunched up, but the other values are more or less straight. Your call. Actually, bunching up at the bottom is not really indicating a problematic departure from normality (the Central Limit Theorem works just fine with short tails); it s long tails or outliers that really cause problems. Page 14

15 One of the classic situations that causes skewness is when there is a lower limit (that the data come close to). In this case, the variable is distance (from the curb), which cannot be less than zero. Most of the drivers parked their car pretty close to the curb (so there were a lot of values close to zero), but a few drivers were a long way away (a few very big positive values). This is exactly the kind of situation where you get skewness, and so it is no surprise that we saw what we did. Most of the cases of skewness that you see in practice can be traced back to data being close to a limit at one end. The classic case of left-skewness is an easy exam: most students get close to 100% (upper limit), while a few get a fair bit less. (b) Test for a difference between the median parking distances between males and females, using Mood s median test. Build this yourself in R, as in the lecture. What do you conclude? Solution: First, work out the overall median of all the distances, regardless of gender: parking %>% summarize(med=median(distance)) ## # A tibble: 1 x 1 ## med ## <dbl> ## 1 9 The overall median is 9. Count up how many distances of each gender were above or below the overall median. tab=with(parking,table(gender,distance<9)) tab ## ## gender FALSE TRUE ## female ## male For example, 19 of the male drivers had a distance (strictly) less than 9. Both genders are pretty close to above and below the overall median, which suggests that the males and females have about the same median. Strictly, I m supposed to throw away any values that are exactly equal to the overall median. Are there any here? any(parking$distance==9) ## [1] TRUE There are. I ll come back to that later, but for now, we ll go with the table we have. Is there an association between gender and being above or below the overall median? That s a chi-squared test for independence: chisq.test(tab,correct=f) ## ## Pearson's Chi-squared test ## ## data: tab ## X-squared = , df = 1, p-value = Page 15

16 This is even less significant (P-value ) than the two-sample t-test we did before, and so is consistent with our conclusion from before that there is actually no difference between males and females in terms of average parking distance. The Mood s median test is believable because it is not affected by outliers or distribution shape. My package smmr does Mood s median test as well as the sign test. (I made up the name as sign and Mood median test in R.) The function median test takes three things: a data frame, a column of values and a column of group memberships (both unquoted), which is exactly what we have: library(smmr) median_test(parking,distance,gender) ## $table ## above ## group above below ## female ## male ## ## $test ## what value ## 1 statistic ## 2 df ## 3 P-value This has a 0-variant that you can use if you can t get this to work. For this, you need to specify two columns: the measurements and the groups. You can specify the data frame twice, or you can use with, like this: with(parking,median_test0(distance,gender)) ## $table ## below ## group FALSE TRUE ## female ## male ## ## $test ## what value ## 1 statistic ## 2 df ## 3 P-value which gives identical results. The numbers in the tables here are a bit different from what we had before. This is because I wrote the function first to get rid of any data values exactly equal to the median (and we previously determined that there are some). We can get a more detailed look this way: Page 16

17 parking %>% filter(distance==9) ## # A tibble: 6 x 2 ## distance gender ## <dbl> <chr> ## 1 9 male ## 2 9 male ## 3 9 female ## 4 9 female ## 5 9 female ## 6 9 female This shows all of the people whose parking distance was exactly 9 (the overall median). There are six of them, two males and four females. In my first attempt, these got counted as FALSE ( not strictly less than 9 ), but now they are thrown away: not counted at all. Check that the female-false frequency has decreased by 4, and the male-false frequency has decreased by 2, with the other two being the same. So the P-value from median test is a bit smaller than before, because it now looks as if slightly more females were below the median distance and slightly more males were above it. Anyway, the P-value is still nowhere near significance, so we have no evidence of a difference in median parking distance between males and females. The kind of deviation from a completely even split is exactly the kind of thing that could have happened by chance. (c) Now we ll repeat the same test in SAS (which has it built in). First read the data into SAS and summarize the values. Solution: This is completely copied from what I did before: 2 proc import datafile='/home/ken/parking.xlsx' dbms=xlsx out=mydata replace; sheet=sheet2; getnames=yes; proc means; var distance; class gender; The MEANS Procedure Analysis Variable : distance distance N gender Obs N Mean Std Dev Minimum Maximum female male Page 17

18 The same number of males and females that we had before, and a slightly smaller mean for the females. Or, find the median and quartiles and compare with the boxplots: proc means q1 median q3; var distance; class gender; The MEANS Procedure Analysis Variable : distance distance N Lower Upper gender Obs Quartile Median Quartile female male Bearing in mind that the SAS and R definitions of quartiles do differ, so you may not get exactly the same thing, these appear to be the same as the boxplots. (d) Run Mood s median test. What do you conclude here, and do you get the same result as R (either the way you did it or the way smmr does it)? Solution: This is proc npar1way with option median (not mood!): proc npar1way median; var distance; class gender; The NPAR1WAY Procedure Median Scores (Number of Points Above Median) for Variable distance Classified by Variable gender Sum of Expected Std Dev Mean gender N Scores Under H0 Under H0 Score male female Average scores were used for ties. Median Two-Sample Test Statistic Z One-Sided Pr > Z Two-Sided Pr > Z Median One-Way Analysis Chi-Square DF 1 Pr > Chi-Square Page 18

19 This gives the same conclusion as before (no difference between the medians for males and females), but a different P-value (look in the Median One-way Analysis at the end of the output). I think the difference is yet another way of handling those observations that are exactly equal to 9. If you go back up to the table of median scores at the top of the output, the Sum of Scores column is the key. If there are no observations exactly equal to the overall median, this will be the numbers in our FALSE columns above: the number of values above the overall median. If there are values equal to the overall median, something else happens. In this case, there are 93 data values altogether. 43 of them are strictly less than the median, 44 are strictly greater and the other 6 are exactly equal to the median. If those values exactly equal to the median were in fact different from each other, they would have ranks 44, 45, from the bottom. The median would have rank (93 + 1)/2 = 47, so the first four of these are less than or equal to the median, and the last two are strictly greater. Now, we have two groups, so if those observations had actually been different from each other, we don t know which ones of them would have been greater than the median and which. So we pretend that 2/6 = 1/3 of them were greater than the median in each group. There were two male observations equal to 9, so SAS pretends that 2(1/3) = 2/3 = 0.67 of them were greater than equal to 9, giving a total of = There were four female observations equal to 9, and 19 strictly greater, giving a total of (1/3) = Those match the sums of scores in the output. Notes 1 There are extra complications with more than one exactly-equal, which we ll see with Mood s median test later. 2 This is the value of keeping all your work and being able to find it later. Page 19

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good