Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding In the previous lecture we learned how to incorporate a categorical research factor into a MLR model by using dummy variables. Given a categorical factor with g levels we construct (g-1) dummy variables as defined by the following coding table: Category C 1 C 2 C 3. C g-1 1 1 0 0 0 2 0 1 0 0 3 0 0 1 0 (g-1) 0 0 0 1 g 0 0 0 0 The coding table is used to assign values of the dummy variables to each individual. The dummy variables are then used as IVs in a regression model, which produces a value of R 2 as well as a regression equation. The value of R 2 indicates the proportion of variance in Y accounted for by the categorical research factor, as represented by the dummy variables.

The partial statistics associated with dummy variable C j are interpreted with reference to a comparison of category j to category g. Thus, category g plays a special role by serving as the category to which all others are compared via the partial statistics. When there is no basis for assigning a particular category to play this role, we may wish to use a different coding method. Unweighted Effects Coding Effects coded variables look very much like dummy variables with one change. Individuals in group g are assigned values of -1 on all effects coded variables. Thus, the coding table would have the following general form: 2 Category C 1 C 2 C 3. C g-1 1 1 0 0 0 2 0 1 0 0 3 0 0 1 0 (g-1) 0 0 0 1 g -1-1 -1-1

3 In our example the coding table would look like this: Category C 1 C 2 C 3 1 (Drug A) 1 0 0 2 (Drug B) 0 1 0 3 (Placebo) 0 0 1 4 (Control) -1-1 -1 Using this coding table we could then assign values of the effects coded IVs to each individual and produce a data matrix of the form Participant Treatment C 1 C 2 C 3 Y 1 A 1 0 0 9 2 A 1 0 0 10 3 B 0 1 0 8 4 B 0 1 0 7 5 Placebo 0 0 1 5 6 Placebo 0 0 1 8 7 Control -1-1 -1 7 n We could then use the effects coded variables as IVs in a MLR analysis with Y as the DV. The analysis would produce a value of R 2 along with a regression equation of the form Yˆ L = B0 + B1C1 + B2C2 + + B g 1C g 1

4 The value of R 2 and all inferential information about R 2 (significance test, confidence interval, correction for shrinkage, etc.) would be identical to results obtained from the MLR analysis with dummy variables. The regression coefficients and other partial statistics associated with the effects coded variables would be different than corresponding information associated with dummy variables. It can be shown that the intercept and coefficients in the regression equation would have the following interpretation: The intercept B 0 would be equal to the mean of the g group means on the Y variable. That is: B 0 = Y&& = Y + Y + Y & 1 2 3 g + L+ Y g This value is called the unweighted mean of the group means, meaning that the group means are not weighted by sample size. If group sample sizes are equal, then this value is equivalent to the grand mean of Y across all n observations. More on this later.

The regression coefficients for effects coded IVs also have a simple interpretation. For the first coded variable, it can be shown that B = Y Y && & 1 1 That is, the regression coefficient for C 1 will equal the difference between the mean for category 1 and the mean of all group means. In our example B 1 would equal the difference between the mean value of Y for individuals in the Drug A condition and the mean of all four group means. Such coefficients can be thought of as representing the effect of membership in a given category. For example, a large positive value of B 1 indicates a strong positive effect of being in the Drug A condition. Each regression coefficient for effects coded IVs has a similar interpretation. The coefficient for C 2 would have the value B = Y Y && & 2 2 and would reflect the effect of being in the Drug B condition. In general for effects coded IV C j, B j = Yj Y &&& reflects the effect of membership in category j. For all such effects, category j is compared to the unweighted mean of all categories. 5

For each B j we can also conduct significance tests and obtain confidence intervals. Such information is interpreted with reference to a comparison of the mean for category j to the unweighted mean of all group means. Similarly, we can obtain a value of sr 2 j associated with each effects coded variable. Such a value would be interpreted as the proportion of variance in Y accounted for by the effect of membership in group j; or more specifically, the proportion of variance in Y accounted for by the difference between the mean for group j and the unweighted mean of all group means. In general, under this type of coding, all partial statistics are interpreted with reference to comparison of a given group to the unweighted mean of all group means. Note the difference between this interpretation and that for partial statistics associated with dummy variables, which are interpreted with reference to comparison of a given group to group g. Note that the use of unweighted effects coding implies that each category counts equally. Differences in sample sizes for different groups are not considered relevant. This would normally be the case in experimental designs. 6

7 Weighted Effects Coding In some situations differences in sample sizes among groups may be indicative of those groups representing different proportions of the full population. For example if the research factor is ethnicity and we take a large sample from the full population we will find different sample sizes for different ethnic groups. Those sample sizes reflect the fact that different ethnic groups make up different proportions of the full population. If we wish for these differences to be represented in our coded variables and in our regression analyses, then effects codes must be adjusted by using the differential sample sizes. See details for these adjustments in Cohen, Cohen, West, & Aiken (2004). The resulting coded variables can then be used as IVs in a regression analysis, producing a value of R 2 and a regression equation. The value of R 2 and associated inferential information will be identical to that obtained using dummy coding or unweighted effects coding.

The coefficients in the regression equation will be different and will be interpreted in terms of weighted means instead of unweighted means. The intercept B 0 will correspond to the weighted mean of all g group means. A regression coefficient B j will be interpreted as a deviation of a group mean from the weighted mean of all g group means. The choice of weighted vs. unweighted effects coding depends primarily on whether differences in sample sizes for different categories of the research factor are reflective of those categories representing different proportions of the full population. The choice of effects coding vs. dummy coding depends at least in part on whether there exists an appropriate choice for a comparison group under dummy coding. 8

9 Contrast Coding A third type of coding can be used when there exist prior hypotheses about particular differences between categories. In our example, for instance, one specific issue of interest might be evaluation of the difference in effectiveness between Drug A and Drug B, ignoring the other two categories. Another might be evaluation of the difference in effectiveness between use of a real drug (Drug A and Drug B) vs. no real drug (Placebo and Control). Such prior hypotheses are called contrasts, and we can design coded variables to represent and provide for the testing of contrasts of interest. Given g categories we would define (g-1) contrast coded variables. The general procedure for defining a contrast coded variable is as follows: Given g categories, a contrast can be seen as defining three subsets of the g categories: Subset U, containing u categories. Subset V, containing v categories. Subset W, containing w categories. The contrast is designed to compare the groups in subset U to those in subset V, ignoring those in subset W.

For example, if we wish to compare Drug A to Drug B, ignoring Placebo and Control conditions, then: Subset U is the Drug A condition, containing u=1 category. Subset V is the Drug B condition, containing v=1 category. Subset W contains the Placebo and Control conditions, thus w=2. Contrast codes (defining a column in the coding table) are then defined as follows: For categories in subset U, codes are set at v/(u+v). For categories in subset V, codes are set at +u/(u+v). For categories in subset W, codes are set at 0. To illustrate, let us define codes for contrast variable C 1 to represent a comparison of Drug A vs. Drug B. The value of C 1 for Drug A condition would be -1/2. The value of C 1 for Drug B condition would be +1/2. The value of C 1 for Placebo and Control conditions would be 0. 10

Thus the first column of the coding table would have the following form: 11 Category C 1 C 2 C 3 1 (Drug A) -1/2 2 (Drug B) +1/2 3 (Placebo) 0 4 (Control) 0 The contrast codes actually define a linear combination of group means: C 1 = 1 Y1 + 1 Y2 + (0) Y3 + (0) Y 2 2 4 Since we need (g-1) coded IVs to carry the information in the categorical research factor, we can (must) define two more contrasts in our example. Let C 2 be defined to represent a comparison of Placebo vs. Control, ignoring Drugs A and B. Let C 3 be defined to represent a comparison of Drugs A and B to Placebo and Control. The full coding table would then take this form: Category C 1 C 2 C 3 1 (Drug A) -1/2 0 +1/2 2 (Drug B) +1/2 0 +1/2 3 (Placebo) 0-1/2-1/2 4 (Control) 0 +1/2-1/2

Note that the contrasts should be defined as independent, or orthogonal. Independence is achieved by defining the contrasts so that the sum of products of codes for a given pair of contrasts is zero. In our example, if we sum the products of the codes in any pair of columns, we get a value of zero. Once contrast codes are defined we can then use the coding table to assign values of the coded variables to each individual. In our example the resulting data matrix would look like this: 12 Participant Treatment C 1 C 2 C 3 Y 1 A -1/2 0 1/2 9 2 A -1/2 0 1/2 10 3 B 1/2 0 1/2 8 4 B 1/2 0 1/2 7 5 Placebo 0-1/2-1/2 5 6 Placebo 0-1/2-1/2 8 7 Control 0 1/2-1/2 7 n Just as we did using other coding methods, we could then proceed with an MLR analysis regressing Y on the three coded IVs.

13 Results would include a value of R 2 and associated inferential information, which would exactly match corresponding results obtained under other coding methods. Results would also include a regression equation of the form Yˆ B + B C + B C + L + B g C = 0 1 1 2 2 1 g 1 Our focus in these results is on the regression coefficients and associated inferential information and partial statistics. (It can be shown that the intercept will be equal to the unweighted mean of the g group means.) In our example, B 1 would equal the value of the contrast defined by C 1 ; specifically, the difference between the unweighted means for the Drug A and Drug B conditions. The significance test for B 1 would be interpreted as a test of the significance of this contrast. The value of sr 2 1 would be interpreted as the proportion of variance in Y accounted for by this contrast.

14 In a similar fashion the partial statistics associated with each contrast coded IV could be interpreted. General Comments on Coding Methods In the case of a single categorical research factor, regardless of which coding method is used, results of an MLR analysis will be equivalent to results of a one-way ANOVA. When different coding methods are used, the value of R 2 and associated inferential information will not change. The values of regression coefficients and other partial statistics will change, as will their interpretation. In general when using coded variables we should always make use of unstandardized regression coefficients rather than standardized coefficients. Standardization of coded variables makes interpretation more difficult.

15 The choice of coding method can be based on the following principles: Dummy coding: Use when there is one group that logically can serve as a reference group to which all others will be compared through the various partial statistics. Effects coding: Use when there is no obvious choice for a reference group and no specific contrasts of interest. Use unweighted effects coding when differences in group sample sizes are irrelevant. Use weighted effects coding when differences in group sample sizes reflect differences in proportional representation in the population. Contrast coding: Use when prior hypotheses lend themselves to the specification of (g-1) independent contrasts.