Goodness-of-Fit Testing T.Scofield Nov. 16, 2016

Size: px

Start display at page:

Download "Goodness-of-Fit Testing T.Scofield Nov. 16, 2016"

Albert Wade
5 years ago
Views:

1 Goodness-of-Fit Testing T.Scofield Nov. 16, 2016 We do goodness-of-fit testing with a single categorical variable, to see if the distribution of its sampled values fits a specified probability model. The probability model is stated in the null hypothesis. As the presence of a null hypothesis implies, goodness-of-fit tests are hypothesis tests. For each of the different possible values of the categorical variable, the null hypothesis should assert a population proportion. Example: A die can produce any of 6 rolls. In the fair die model, we expect these rolls to be equally likely, with each occurring (over the long haul) one-sixth of the time. To see if this model is a good fit for a sample of rolls taken from a particular die, we would presume this null hypothesis: Note that the asserted probabilities sum to 1: H 0 : p 1 = p 2 = p 3 = p 4 = p 5 = p 6 = = 1, a general principle in goodness-of-fit testing. Sample data would consist of n rolls of a die. Assuming the null hypothesis holds, we would expect to see n 1 6 instances of rolls which are 1, n/6 instances of twos, and so on. That is, the expected count of each different value would be n/6. Example: In Mendelian genetics, the law of independent assortment asserts that, for dihybrid crosses, asserts that combinations of two traits will occur with frequencies in a 9:3:3:1 ratio. A test to see if this Mendelian model applies in the case of two traits would have null hypothesis H 0 : p 1 = 9 16, p 2 = 3 16, p 3 = 3 16, p 4 = Mendel s work was on peas, and the two traits often used in descriptions of this work are color (Yellow vs. green) and texture (Smooth vs. wrinkled). If the model asserted in the null hypothesis holds then, when observing n peas, one would expect to see 9 16 n which are Yellow and Smooth, 3 16n which are Yellow and 3 1 wrinkled, 16n which are green and Smooth, and 16n which are green and wrinkled. The sample size and null hypothesis together give us expected counts E i. The data collected gives us observed counts O i by way of a frequency table. We use the chi-square (or χ 2 ) statistic χ 2 = Σ (O i E i ) 2 E i, as an overall measure of the discrepancy between the frequencies we expected and what we actually observed. This number, which cannot be negative, is zero when observed frequencies match expected ones exactly, but grows as observed counts become increasingly different from expected ones. It would be convenient to have a function in RStudio to calculate chi-square statistics for us, much as the mean() and sd() functions calculate x and s from sample data. The initiation cell below, which you should execute, provides two commands for calculating χ 2. There are two versions: (), which is useful when we have the raw data (i.e., one value of the categorical variable for every case), and ftchisqstat(), useful when we have a frequency table of observed counts in the sample. <- function(datavector, probs) { sampsize = length(datavector) observedcounts <- tally(datavector) expectedcounts <- probs * sampsize 1

2 chisqstat <- sum( (observedcounts - expectedcounts)^2 / expectedcounts ) return ( chisqstat ) } ftchisqstat <- function(obscounts, probs) { sampsize <- sum(obscounts) expectedcounts <- probs * sampsize chisqstat <- sum( (obscounts - expectedcounts)^2 / expectedcounts ) return ( chisqstat ) } Goodness-of-fit via randomization sampling An example involving answers for multiple choice tests Now that we have the calculation that leads to a test statistic, we turn our attention to producing a P -value. Chapter 4 showed us how to produce randomization distributions for other hypothesis test settings. Here we consider how we might produce a simulated distribution for χ 2 values under the null hypothesis. The key is to draw sample data like ours (i.e., with the exact same sample size) with replacement from a bag that is stocked according to the specifications in the null hypothesis. Consider the Lock data set APMultipleChoice. This is raw data, where each row gives the letter which was the correct answer of a particular multiple choice question. The values of the Answer variable are gathered and stored in optionslist using the command optionslist <- names(tally(apmultiplechoice$answer)) We look at optionslist optionslist ## [1] "A" "B" "C" "D" "E" and see that the questions had 5 possible letters. The observed counts, telling us how often each letter was the correct answer, appear in the frequency table: tally(apmultiplechoice$answer) ## X ## A B C D E ## A total of sum(tally(apmultiplechoice$answer)) ## [1] 400 multiple choice questions appear in our sample. A teacher might consider a good multiple choice test to be one in which each letter is equally likely to be the correct one. This good test model naturally gives rise to the null hypothesis This yields a list of expected frequencies hypprobs = rep(1,5) / * hypprobs H 0 : p A = p B = p C = p D = p E =

3 ## [1] and the corresponding test (χ 2 ) statistic is computed by the () command defined above: myteststat <- (APMultipleChoice$Answer, hypprobs) myteststat ## [1] A randomization sample (which takes our null hypothesis and sample size into account) could be produced using the command resample(optionslist, size=400, prob=hypprobs) The corresponding randomization statistic would be the χ 2 statistic of this randomization sample (resample(optionslist, size=400, prob=hypprobs), hypprobs) ## [1] To get a randomization distribution, we want to gather many randomization statisics manycsstats <- do(1000) * (resample(optionslist, size=400, prob=hypprobs), hypprobs) head(manycsstats) ## ## ## ## ## ## ## histogram(~, data=manycsstats, groups=>=myteststat) Since χ 2 statistics get larger as the observed counts become increasingly more extreme (more extremely different than expected ones), a goodness-of-fit test is always a 1-tailed (an upper-tailed) test, one concerned with the area (probability) corresponding to values as high or higher than the test statistic (χ 2 from the actual data). So, the approximate P value is 3

4 nrow(subset(manycsstats, >= myteststat)) / 1000 # gives approx. P-value ## [1] 0.49 An example involving the breakdown of days on which babies are born There is another package, abd, one can load to gain access to the DayOfBirth data frame. require(abd) ## Loading required package: abd ## Loading required package: nlme ## ## Attaching package: nlme ## The following object is masked from package:dplyr : ## ## collapse ## Loading required package: grid DayOfBirth ## day births ## 1 Sunday 33 ## 2 Monday 41 ## 3 Tuesday 63 ## 4 Wednesday 63 ## 5 Thursday 47 ## 6 Friday 56 ## 7 Saturday 47 This is not the raw data, which would have contained one row per birth. The variable in question is the day on which the birth occurred. The data comes to us already summarized in a frequency table. A reasonable null hypothesis is that it is equally likely that a birth would fall on any of the days of the week: H 0 : p Su = p M = p T u = p W = p T h = p F = p Sa = 1 7. We can use the other chi-square computing function defined above to obtain a test statistic: hypprobs = rep(1, 7) / 7 ftchisqstat(dayofbirth$births, hypprobs) ## [1] Since there are 350 births in the sample, and the values (Sunday Saturday) are stored in the day column, we may produce a randomization distribution via manychisqs = do(1000) * (resample(dayofbirth$day, prob=rep(1,7)/7, size=350), hypprobs) histogram(~, data=manychisqs) 4

5 nrow(subset(manychisqs, >= 15.24)) / 1000 # gives approx. P-value ## [1] This P -value is small of enough to be significant at the 5% level. In rejecting H 0, what we can say is that for at least one of the days, the likelihood that a birth occurs on that day is something other than 1over7. Obtaining a P -value using a chi-square distribution We should still have randomization distributions stored in manycsstats (good multiple choice test model) and in manychisqs (births equally-likely on all days of the week model). When we viewed these distributions above, they did not look normal. They do, however, have shapes which are well-approximated by density curves that come from a known distributional family: the chi-square distributions. Specifically, the null distribution (for χ 2 statistics) in the case of good multiple choice tests with optional answers A E follows closely a χ 2 distribution with 4 degrees of freedom: histogram(~, data=manycsstats) plotdist("chisq", df=4, add=true) 5

6 approximate P -value via the command 1 - pchisq(3.425, df=4) This means we could have obtained the ## [1] instead of obtaining it from a randomization distribution. Similarly, the null distribution for χ 2 statistics is a close match to the chi-square distribution with df = 6: histogram(~, data=manychisqs) plotdist("chisq", df=6, add=true) pchisq(15.24, df=6) We obtain the approximate P -value via ## [1] Some general remarks: When our (single) categorical variable has 5 different possible values, the number of degrees of freedom in the approximating chi-square distribution is 4. When it is a categorical variable with 7 values, we 6

7 have df = 6. The best approximating chi-square distribution is the one with 1 fewer dfs than the number of values/categories seen in the categorical variable. Even though the previous bullet point indicates the best chi-square distribution (the best choice of df) to use in approximating P -values, it is necessarily the case that any chi-square distribution gives good approximations. As in the past, there is a rule of thumb (more than one of them, in fact). The Lock s indicate that, if you inspect the expected counts and find them all to be at least 5, then you are assured P -values obtained from a chi-square distribution (i.e., using the pchisq() command) give reasonably acceptable approximations. 7

7.2: Chi-Square Test for Association T.Scofield Nov. 17, 2016

7.2: Chi-Square Test for Association T.Scofield Nov. 17, 2016 72: Chi-Square Test for Association TScofield Nov 17, 2016 The goal of this section is to provide means for investigating whether there is an association between two categorical variables Before proceeding,