7.2: Chi-Square Test for Association T.Scofield Nov. 17, 2016

Size: px

Start display at page:

Download "7.2: Chi-Square Test for Association T.Scofield Nov. 17, 2016"

Darrell Brooks
5 years ago
Views:

1 72: Chi-Square Test for Association TScofield Nov 17, 2016 The goal of this section is to provide means for investigating whether there is an association between two categorical variables Before proceeding, it may be helpful to enter the following commands, which will add new commands called expectedcounts() and chisqstat() to our list of available ones expectedcounts <- function(stable) { casetotal = sum(stable) expcounts = stable for (ii in 1:nrow(sTable)) { for (jj in 1:ncol(sTable)) { expcounts[ii,jj] = sum(stable[ii,])*sum(stable[,jj]) / casetotal return (expcounts) chisqstat <- function(otab) { X2 = 0 etab <- expectedcounts(otab) for (ii in 1:nrow(oTab)) { for (jj in 1:ncol(oTab)) { X2 = X2 + (otab[ii,jj] - etab[ii,jj])^2 / etab[ii,jj] return (X2) Example: Is there an association between socioeconomic status (SES) and smoking? We will address this question using an hypothesis test The null and alternative hypotheses are stated this way: H 0 : H a : no association exists between SES and smoking status there is an association between SES and smoking status One can imagine the data one would collect for a study to address this question For each person sampled, we would record the values of two categorical variables: SES and smoking status case SES Smoking Status 1 middle current 2 low current 3 high never 4 middle former But perhaps the data comes to us not in this raw form, but already summarized as a two-way (contingency) table 1

2 current former never high low middle It is convenient to know steps one can use to build such a table inside RStudio directly from the numbers (ie, when no raw data set is available) These commands will do so, giving it the name smoketable ## current former never ## high ## low ## middle smoke <- matrix(c(51,92,68,43,28,22,22,21,9),ncol=3,byrow=true) colnames(smoke) <- c("current","former","never") rownames(smoke) <- c("high","low","middle") smoketable <- astable(smoke) If we wish to view the table at this point, we may do so by typing the name it in which it is stored: smoketable Because both of the categorical variables have 3 values, the table itself has 9 cells We might wish to see row and column totals as well, which is achieved by wrapping the table s name in an addmargins() command addmargins(smoketable) ## current former never Sum ## high ## low ## middle ## Sum As with goodness-of-fit testing, our test statistic will be χ 2 = (O i E i ) 2 E i The O i here are the numbers found in the various cells of the two-way table, while the E i represent the numbers we would have expected in these cells under the null hypothesis The expectedcounts() command we built above can help us with the E i Both of the commands created there have limited use They both require a two-way table as input, which means they are not appropriate for the goodness-of-fit tests we have previously discussed In our present context, however, we can process expected counts for our smoketable expectedcounts(smoketable) Once again, we can wrap this in addmargins() to see row/column totals addmargins(expectedcounts(smoketable)) ## current former never Sum ## high ## low ## middle ## Sum Take a moment and find the two-way table containing observed counts above Note that row and column totals are unchanged between that table and this one containing expected counts It is the contents of the individual cells which have been altered In the table of expected counts, for instance, we have that for each SES status, the proportion of current smokers is the same: 2

3 among high SES cases, the portion is 6875/211 = 0326 among low SES cases, the portion is 303/93 = 0326 among middle SES cases, the portion is 1694/52 = 0326 These sorts of proportions/ratios are consistent whichever way you look at them, either across rows or columns As another instance, if we focus on proportions of low SES across the various smoking statuses, we see among current smokers, the proportion is 303/116 = 0261 among former smokers, the proportion is 3683/141 = 0261 among those who never smoked, the proportion is 2586/99 = 0261 This is how things would be expected to look in a population perfectly represented by our sample when the two variables are independent (not associated) While software has provided us with the expected counts, the method for calculating each is straightforward, given that each cell has row and column totals which are identical to that of the actual data: E i = (row total) (column total) sample size The chisqstat() function we defined above can compute for us the χ 2 test statistic (though you should practice doing this by hand and see that you obtain the same value) As with our expectedcounts() command, it requires, as input, the two-way table chisqstat(smoketable) ## [1] Assuming our expected counts are all at least 5 (same rule of thumb as the Locks gave us for goodness-of-fit testing), we can obtain an approximate P -value from a chi-square distribution with df s given by df = [(number of rows) - 1] [(number of columns) - 1] In this case, our smallest expected count is 1446, so we choose df = (3 1) (3 1) = 4, and compute the approximate P -value (recalling that this is a 1-sided, right-tailed test): 1 - pchisq(1851, df=4) ## [1] We would reject the null hypothesis at the 5% level (also the 1% level), here, and conclude there is an association between SES and smoking status Example: This one provides a variation on Example 710, p~481 in the text We have access to the raw data for this example, which is found in the data frame WaterTaste We obtain a two-way table of observed frequencies, along with one containing the corresponding expected counts, for the categorical variables in question: UsuallyDrink and First: fulltable <- tally(~usuallydrink + First, data=watertaste) fulltable ## First ## UsuallyDrink Aquafina Fiji SamsChoice Tap ## Bottled ## Filtered ## Tap expectedcounts(fulltable) ## First ## UsuallyDrink Aquafina Fiji SamsChoice Tap 3

4 ## Bottled ## Filtered ## Tap We see the rule of thumb that all expected counts be at least 5 is not met Many authors state a different rule of thumb, saying you are safe to use a chi-square distribution to approximate the P -value if no expected count is less than 1, and no more than 20% of expected counts are less than 5, but even this relaxed rule of thumb is not met At this stage we have two options: We could combine values of a variable This is the approach taken in the text, where they have combined the Filtered and Tap rows into a single category they call "Tap/Filtered" The new expected counts after combining these rows are displayed in parentheses in Table 723 on p 482, and while there is still one of the 8 cells that contains an expected count smaller than 5, the relaxed rule of thumb is satisfied If we really do not wish to combine categories, we might use randomization to produce an approximate null distribution (distribution of χ 2 values from randomization samples) and obtain a P -value from it This approach is not overly difficult in this instance, particularly because we have the raw data set We adopt this approach below The command tally(~shuffle(usuallydrink) + shuffle(first), data=watertaste) produces a randomization sample Notice that if we execute this command and include row/column totals, these totals are maintained even though the simulated observed counts found in the individual cells change addmargins(tally(~shuffle(usuallydrink) + shuffle(first), data=watertaste)) ## shuffle(first) ## shuffle(usuallydrink) Aquafina Fiji SamsChoice Tap Sum ## Bottled ## Filtered ## Tap ## Sum We obtain an individual randomization (χ 2 ) statistic by generating a randomization sample and wrapping that inside a call to the chisqstat() function: chisqstat(tally(~shuffle(usuallydrink) + shuffle(first), data=watertaste)) ## [1] It is this that we would want to repeat often in order to generate a randomization distribution manychisqs = do(1000) * chisqstat(tally(~shuffle(usuallydrink) + shuffle(first), data=watertaste)) head(manychisqs) ## chisqstat ## ## ## ## ## ## Our test statistic is obtained similarly, but without shuffling: chisqstat(tally(~usuallydrink + First, data=watertaste)) ## [1]

5 We view the corresponding distribution, shading the region to the right of our test statistic, and compute the approximate P -value As the two-way table has 3 rows and 4 columns, the chi-square distribution which best approximates the null distribution is the one with df = (2)(3) = 6 We overlay this distribution to illustrate how similar (or not) it is to the randomization distribution Since the rule of thumb for using a chi-square distribution is not met, the implication is that these two (the randomization distribution and the chi-square density curve) are not similar enough to warrant using the pchisq() command to obtain a P -value histogram(~chisqstat, data=manychisqs, groups = chisqstat>=497) plotdist("chisq", df=6, add=true) 010 Density chisqstat nrow(subset(manychisqs, chisqstat >= 497)) / 1000 ## [1] 0553 With this high P -value, we fail to reject the null hypothesis 5

Goodness-of-Fit Testing T.Scofield Nov. 16, 2016

Goodness-of-Fit Testing T.Scofield Nov. 16, 2016 We do goodness-of-fit testing with a single categorical variable, to see if the distribution of its sampled values fits a specified probability model. The