Loglinear Models for Categorical Data. Michael Friendly

Size: px

Start display at page:

Download "Loglinear Models for Categorical Data. Michael Friendly"

Eileen Bradley
5 years ago
Views:

1 Admit?: Yes Sex: Male Admit?: No Right Eye Grade High 2 3 Unaided distant vision data Brown Hazel Green Blue Sex: Female Low High 2 3 Left Eye Grade Low Black Brown Red Blond Michael Friendly York University, <friendly@yorku.ca> SPIDA 2004

2 outline Lecture Outline Introduction Graphical methods for Categorical Data Loglinear models: Prelimaries Overview of loglinear models Visualizing contingency tables and loglinear models 2 2 tables Larger tables n-way tables Structured tables Ordinal variables Square tables SPIDA Michael Friendly

3 outline Resources SAS macro programs, from SAS System for Statistical Graphics (Friendly, 1991): SAS macro programs, from Visualizing Categorical Data (Friendly, 2000): Workshop notes: SCS Short Course notes, Categorical Data Analysis with Graphics SPIDA Michael Friendly

4 overview2 Graphical Methods for Categorical Data If I can t picture it, I can t understand it. Albert Einstein Getting information from a table is like extracting sunlight from a cucumber. Farquhar & Farquhar, 1891 Tables vs. Graphs Tables are best suited for look-up read off exact numbers Graphs are better for showing patterns, trends, anomalies, making comparisons Visual presentation as communication: what do you want to say? SPIDA Michael Friendly

5 overview2 Visual metaphors Quantitative data: magnitude position along an axis Frequency data: count area (Friendly, 1995) Model: (DeptGender)(Admit) Admit?: Yes Sex: Male Admit?: No A B C D E F Sex: Female Admitted Male Rejected Female Fourfold display for 2 2 table Mosaic plot for 3-way table SPIDA Michael Friendly

6 Auto data: PC2/1 order Displa Gratio Hroom Length MPG Price Rep77 Rep78 Rseat Trunk Turn Weight Gratio MPG Rep78 Rep77 Price Hroom Trunk Rseat Length Weight Displa Turn Turn Displa Weight Length Rseat Trunk Hroom Price Rep77 Rep78 MPG Gratio Weight Turn Trunk Rseat Rep78 Rep77 Price MPG Length Hroom Gratio Displa Loglinear Models for Categorical Data overview2 Principles of Graphical Displays Effect ordering (Friendly and Kwan, 2003) In tables and graphs, sort unordered factors according to the effects you want to see/show. Auto data: Alpha order Corrgrams: Exploratory displays for correlation matrices (Friendly, 2002) SPIDA Michael Friendly

7 overview2 Comparisons Make visual comparisons easy Visual grouping connect with lines, make key comparisons contiguous Baselines compare data to model against a line, preferably horizontal Frequency Sqrt(frequency) Number of Occurrences Number of Occurrences SPIDA Michael Friendly

8 overview2 Small multiples combine stratified graphs into coherent displays (Tufte, 1983) e.g., scatterplot matrix for quantitative data: all pairwise scatterplots 87.2 Prestige Educ Income Women 0 SPIDA Michael Friendly

9 overview2 e.g., mosaic matrix for quantitative data: all pairwise mosaic plots Admit Male Female A B C D E F Admit Reject A B C D E F Male Female Gender Admit Reject Male Female A B A B C C Dept SPIDA Michael Friendly D D E E F F Male Female Admit Reject Admit Reject

10 prelim Loglinear models: Preliminaries Categorical data: Variables with a finite number of discrete values Binary: pass/fail, live/die, vote yes/vote no Nominal: unordered categories Religion: Christian, Jewish, Muslim,... Party preference: Conservative, Liberal, NDP, Bloc Québécois, Green Party Ordinal: ordered categories Education: Primary, Secondary, University, Post-graduate Income ranges: 0-20, 20-30, 30-40,... Likert scale: strongly disagree strongly agree Interval: Counts: number of children 0, 1, 2,... SPIDA Michael Friendly

11 prelim Loglinear models: Preliminaries Data representations: Case form: One row per case (observation), one column per variable standard data table (Quantitative data is always represented in case form) Table form: A set of n variables represented as an n-way table Table 1: Admissions to Berkeley graduate programs Admitted Rejected Total Males Females Total SPIDA Michael Friendly

12 prelim 3+-way tables Table 2: Berkeley data: 3-way table Male Female Dept Admit Reject Admit Reject A B C D E F SPIDA Michael Friendly

13 prelim Frequency form: One row per combination of variable categories 1 Obs Admit Gender Dept count Admitted Male A Admitted Male B Admitted Male C Admitted Male D Admitted Male E Admitted Male F Admitted Female A Admitted Female B Admitted Female C Admitted Female D Admitted Female E Admitted Female F Rejected Female E Rejected Female F 317 SPIDA Michael Friendly

14 prelim Loglinear models: Preliminaries Data transformations: All SAS procedures take input in case form. Most SAS procedures take input in frequency form need to use a FREQ or WEIGHT statement to identify the frequency (count) variable. Categorical variables can be represented by numeric or character values Character variables are ordered alphabetically by default often not what you want (e.g., Low, Medium, High High, Low, Medium) Often easiest to use numeric values, and create a SAS format to attach value labels: 1 proc format; 2 value inc 1= Low 2= Medium 3= High ; 3 proc freq; 4 tables educ * income; 5 format income inc.; SPIDA Michael Friendly

15 prelim Recoding categories Categories can be collapsed (frequencies summed) using SAS formats 1 proc format; 2 value edu 1-8= Primary 9-12= Secondary = University 16-high= Post-grad ; 4 proc freq; 5 tables educ * income; 6 format educ edu. income inc.; Reordering categories Many SAS procedures allow several different ways to order the categories of a discrete variable ORDER=INTERNAL by numeric or character value (default). ORDER=DATA by order of appearance in the dataset. ORDER=FORMATTED by order of formatted value. ORDER=FREQ by order of frequency of occurrence. The table macro Collapse or recode variables SPIDA Michael Friendly

16 overview Related perspectives Loglinear models: Overview Loglinear models can be developed as an analog of classical ANOVA and regression models, where multiplicative relations (under independence) are re-expressed in additive form as models for log(frequency). Alternatively, but somewhat more generally, loglinear models are also GLMs for log(frequency), with a Poisson distribution for the cell counts. Two-way tables: Loglinear approach For two discrete variables, A and B, suppose a multinomial sample of total size n over the IJ cells of a two-way I J contingency table, with cell frequencies n ij, and cell probabilities π ij = n ij /n. The table variables are statistically independent when the cell (joint) probability equals the product of the marginal probabilities, Pr(A = i & B = j) = Pr(A = i) Pr(B = j), or, π ij = π i+ π +j. An equivalent model in terms of expected frequencies, m ij = nπ ij is m ij = n m i+ m +j. SPIDA Michael Friendly

17 overview This multiplicative model can be expressed in additive form as a model for log m ij, log m ij = log n + log m i+ + log m +j. (1) By anology with ANOVA models, the independence model (1) can be expressed as log m ij = µ + λ A i + λ B j, (2) where µ is the grand mean of log m ij and the parameters λ A i and λ B j express the marginal frequencies of variables A and B, and are typically defined so that i λa i = j λb j = 0. Dependence between the table variables is expressed by adding association parameters, λ AB ij, giving the saturated model, log m ij = µ + λ A i + λ B j + λ AB ij. (3) The saturated model fits the table perfectly ( m ij = n ij ): there are as many parameters as cell frequencies. A global test for association tests H 0 : λ AB ij = 0. For ordinal variables, the λ AB may be structured more simply, giving tests for ordinal association. ij SPIDA Michael Friendly

18 overview Two-way tables: GLM approach In the GLM approach, the vector of cell frequencies, n = {n ij } is specified to have a Poisson distribution with means m = {m ij } given by log m = Xβ where X is a known design (model) matrix and β is a column vector containing the unknown λ parameters. For example, for a 2 2 table, the saturated model (3) with the usual zero-sum constraints can be represented as log m 11 log m 12 log m 21 log m 22 = µ λ A 1 λ B 1 λ AB 11 Note that only the linearly independent parameters are represented. λ A 2 = λ A 1, because λa 1 + λ A 2 = 0, and so forth. An advantage of the GLM formulation is that it makes it somewhat easier to express models with ordinal or quantitative variables. SPIDA Michael Friendly

19 overview Three-way Tables Saturated model: For a 3-way table, of size I J K for variables A, B, C, the saturated loglinear model includes associations between all pairs of variables, as well as a 3-way association term, λ ABC ijk log m ijk = µ + λ A i + λ B j + λ C k + λ AB ij + λ AC ik + λ BC jk + λ ABC ijk. (4) As before, one-way terms (λ A i, λb j, λc k ij, λ AC ik, λbc jk ) pertain to differences in the marginal frequencies of the table variables. Two-way terms (λ AB ) pertain to the partial association for each pair of variables, controlling for the remaining variable. The three-way term, λ ABC allows the partial association between any pair of ijk variables to vary over the categories of the third variable. Such models are usually hierarchical: the presence of a high-order term, such as λ ABC implies that all low-order relatives are automatically included. ijk Thus, a short-hand notation for a loglinear model lists only the high-order terms, i.e., model (4) [ABC] SPIDA Michael Friendly

20 overview Reduced models: The usual goal is to fit the smallest model (fewest high-order terms) that is sufficient to explain/describe the observed frequencies. All representative subset models are shown in the table below Table 3: Log-linear Models for Three-Way Tables Model Model symbol Interpretation Mutual independence [A][B][C] A B C Joint independence [AB][C] (A B) C Conditional independence [AC][BC] (A B) C All two-way associations [AB][AC][BC] homogeneous association Saturated model [ABC] interaction e.g., [AB][C] log m ijk = µ + λ A i + λ B j + λ C k + λ AB ij SPIDA Michael Friendly

21 overview Assessing goodness of fit Goodness of fit of a specified model may be tested by the likelihood ratio G 2, G 2 = 2 i n i log(n i / m i ), (5) or the Pearson χ 2, χ 2 = i (n i m i ) 2 m i, (6) with degrees of freedom = # cells - # estimated parameters. E.g., for the model of mutual independence, [A][B][C], df = IJK (I 1) (J 1) (K 1) = (I 1)(J 1)(K 1) The terms summed in (5) and (6) are the squared cell residuals Other measures of balance goodness of fit against parsimony, e.g., Akaike s Information Criterion (smaller is better) AIC = G 2 2df or AIC = G # parameters SPIDA Michael Friendly

22 vismethods Visualizing Contingency tables Two-way tables 2 2 tables Visualize odds ratio (FFOLD macro) 2 2 k tables Homogeneity of association r 3 tables Trilinear plots (TRIPLOT macro) r c tables Visualize association (SIEVE program) r c tables Visualize association (MOSAIC macro) Square r r tables Visualize agreement (AGREE program) n-way tables Fit loglinear models, visualize lack-of-fit (MOSAIC macro) Test & visualize partial association (MOSAIC macro) Visualize pairwise association (MOSMAT macro) Visualize conditional association (MOSMAT macro) Visualize loglinear structure (MOSMAT macro) Correspondence analysis and MCA (CORRESP macro) SPIDA Michael Friendly

23 2by2 2 2 tables 2 2 tables provide the simplest context to illustrate loglinear models and their relations to other tests and modeling techniques. Example: Bickel et al. (1975): admissions to graduate depatments at U. C. Berkeley in 1973 Aggregate data for the six largest departments: Table 4: Admissions to Berkeley graduate programs Admitted Rejected Total % Admitted Males Females Total Equivalent questions: Proportions: Are the proportions of males and females admitted the same? Method: Test of independent proportions, H 0 : p 1 = p 2 Answer: z = ˆp 1 ˆp 2 = p(1 p)(1/n1 +1/n 2 ) = (.6163)(1/2691+1/1835) SPIDA Michael Friendly

24 2by2 2 2 tables Odds: Are the odds of admission the same for males and females? Odds(Admit Male) Method: Odds ratio, H 0 : θ = Odds(Admit = 1 Female) Odds(Admit Male) Answer: θ = Odds(Admit = 1198/1493 Female) 557/1276 = 1.84 Males 84% more likely to be admitted. Is admission independent of gender? Method: χ 2 test of independence, H 0 : p ij = p i p j Answer: χ 2 (1) = (n ij m ij ) 2 m ij = 92.21, p < Does cell frequency depend only on marginal frequencies of gender and admission? Method: Loglinear model, log(m ij ) = µ + λ A i + λ G j + λag ij Answer: H 0 : λ AG ij = 0 LR G 2 (1) = 93.45, p < SPIDA Michael Friendly

25 2by2 Standard analysis: PROC FREQ proc freq data=berkeley; weight freq; tables gender*admit / chisq; Output: Statistics for Table of gender by admit Statistic DF Value Prob Chi-Square <.0001 Likelihood Ratio Chi-Square <.0001 Continuity Adj. Chi-Square <.0001 Mantel-Haenszel Chi-Square <.0001 Phi Coefficient How to visualize and interpret? SPIDA Michael Friendly

26 2by2 Fourfold displays for 2 2 tables Quarter circles: radius n ij area frequency Independence: Adjoining quadrants align Odds ratio: ratio of areas of diagonally opposite cells Confidence rings: Visual test of H 0 : θ = 1 adjoining rings overlap Sex: Male Admit?: Yes Admit?: No Sex: Female Confidence rings do not overlap: θ 1 SPIDA Michael Friendly

27 2by2 Fourfold displays for 2 2 k tables Data in Table 4 had been pooled over departments Stratified analysis: one fourfold display for each department Each 2 2 table standardized to equate marginal frequencies Shading: highlight departments for which H a : θ i 1 Department: A Sex: Male Department: B Sex: Male Department: C Sex: Male Admit?: Yes Admit?: No Admit?: Yes Admit?: No Admit?: Yes Admit?: No Sex: Female Department: D Sex: Male Sex: Female Department: E Sex: Male Sex: Female Department: F Sex: Male Admit?: Yes Admit?: No Admit?: Yes Admit?: No Admit?: Yes Admit?: No Sex: Female Sex: Female Sex: Female Only one department (A) shows association; θ A = women (0.349) 1 = 2.86 times as likely as men to be admitted. SPIDA Michael Friendly

28 2by2 What happened here? Simpson s paradox: Aggregate data are misleading because they falsely assume men and women apply equally in each field. But: Large differences in admission rates across departments. Men and women apply to these departments differentially. Women applied in large numbers to departments with low admission rates. (This ignores possibility of structural bias against women: differential funding of fields to which women are more likely to apply.) Other graphical methods can show these effects. SPIDA Michael Friendly

29 2by2 The FOURFOLD program and the FFOLD macro The FOURFOLD program is written in SAS/IML. The FFOLD macro provides a simpler interface. Printed output: (a) significance tests for individual odds ratios, (b) tests of homogeneity of association (here, over departments) and (c) conditional association (controlling for department). Plot by department: 1 %include catdata(berkeley); 2 berk4f.sas 3 %ffold(data=berkeley, 4 var=admit Gender, /* panel variables */ 5 by=dept, /* stratify by dept */ 6 down=2, across=3, /* panel arrangement */ 7 htext=2); /* font size */ Aggregate data: first sum over departments, using the TABLE macro: 8 %table(data=berkeley, out=berk2, 9 var=admit Gender, /* omit dept */ 10 weight=count, /* frequency variable */ 11 order=data); 12 %ffold(data=berk2, var=admit Gender); SPIDA Michael Friendly

30 2by2 The FOURFOLD program and the FFOLD macro Printed output: Odds (Admitted Male) / (Admitted Female) Odds Ratio Log Odds SE(LogOdds) Z Pr> Z A B C D E F SPIDA Michael Friendly

31 2by2 For a 2 2 k table, additional tests are also of interest: Homogeneity of odds ratios: H 0 : θ 1 = θ 2 = θ k (This is equivalent to the loglinear model of no 3-way association, [AG][AD][DG].) Conditional independence: Admit Gender Dept Test of Homogeneity of Odds Ratios (no 3-Way Association) TEST CHISQ DF PROB Homogeneity of Odds Ratios Conditional Independence of Admit and Gender Dept (assuming Homogeneity) TEST CHISQ DF PROB Likelihood-Ratio Cochran-Mantel-Haenszel SPIDA Michael Friendly

32 twoway Two-way frequency tables Table 5: Hair-color eye-color data Eye Hair Color Color Black Brown Red Blond Total Green Hazel Blue Brown Total SPIDA Michael Friendly

33 twoway Two-way frequency tables: PROC FREQ proc freq data=haireye; weight count; tables Eye * HAIR / chisq; Output: Statistics for Table of Eye by HAIR Statistic DF Value Prob Chi-Square <.0001 Likelihood Ratio Chi-Square <.0001 Mantel-Haenszel Chi-Square <.0001 Phi Coefficient SPIDA Michael Friendly

34 twoway Two-way frequency tables: PROC GENMOD We can fit the same model as a Poisson GLM for count: proc genmod data=haireye; class hair eye; model count = hair eye hair*eye / dist=poisson type3; Output: LR Statistics For Type 3 Analysis Chi- Source DF Square Pr > ChiSq hair <.0001 eye <.0001 hair*eye <.0001 How to visualize and interpret? SPIDA Michael Friendly

35 twoway count area Two-way frequency tables: Sieve diagrams When row/col variables are independent, n ij n i+ n +j each cell can be represented as a rectangle, with area = height width frequency, n ij Green Expected frequencies: Hair Eye Color Data Hazel Eye Color Blue Brown Black 286 Brown Hair Color 71 Red 127 Blond 592 SPIDA Michael Friendly

36 twoway Sieve diagrams Height/width marginal frequencies, n i+, n +j Area expected frequency, n i+ n +j Shading observed frequency, n ij, color: sign(n ij ˆm ij ). Independence: Shown when density of shading is uniform. Green Hazel Eye Color Blue Brown Black Brown Red Blond Hair Color SPIDA Michael Friendly

37 twoway Sieve diagrams Effect ordering: Reorder rows/cols to make the pattern coherent Blue Eye Color Green Hazel Brown Black Brown Red Blond Hair Color SPIDA Michael Friendly

38 twoway Sieve diagrams Vision classification data for 7477 women Unaided distant vision data High Right Eye Grade 2 3 Low High 2 3 Left Eye Grade Low SPIDA Michael Friendly

39 nway2 Mosaic displays and Log-linear Models Hartigan and Kleiner (1981), Friendly (1994, 1999): Width one set of marginals, n i+ Height relative proportions of other variable, p j i = n ij /n i+ area frequency, n ij = n i+ p j i Model: (Gender)(Admit) Admitted Rejected Standardized residuals: <-4-4:-2-2:-0 0:2 2:4 >4 Male Female SPIDA Michael Friendly

40 nway2 Shading: Sign and magnitude of Pearson χ 2 residual, d ij = (n ij ˆm ij )/ ˆm ij (or L.R. G 2 ) Sign: negative in red; + positive in blue Magnitude: intensity of shading: d ij > 0, 2, 4,... Independence: Rows align, or cells are empty! E.g., aggregate Berkeley data, independence model: Model: (Gender)(Admit) Admitted Rejected Standardized residuals: <-4-4:-2-2:-0 0:2 2:4 >4 Male Female SPIDA Michael Friendly

41 nway2 Mosaic displays Departments Gender: Did departments differ in the total number of applicants? Did men and women apply differentially to departments? Model: (Dept)(Gender) A B C D E F Model [Dept] [Gender]: G 2 (5) = Note: Departments ordered A F by overall rate of admission. Male Female SPIDA Michael Friendly

42 nway2 Mosaic displays: Hair color and eye color Brown Hazel Green Blue Dark hair goes with dark eyes, light hair with light eyes Red hair, hazel eyes an exception? Effect ordering: Rows/cols permuted by CA Dimension Black Brown Red Blond SPIDA Michael Friendly

43 nway2 Mosaic displays for multiway tables Generalizes to n-way tables: divide cells recursively Can fit any log-linear model (e.g., 3-way), Table 6: Log-linear Models for Three-Way Tables Model Model symbol Independence interpretation Mutual independence [A][B][C] A B C Joint independence [AB][C] (A B) C Conditional independence [AC][BC] (A B) C All two-way associations [AB][AC][BC] (none) Saturated model [ABC] (none) e.g., the model for conditional independence (A C B): [AB][BC] log m ijk = µ + λ A i + λ B j + λ C k + λ AB ij + λ BC jk Each mosaics shows: DATA (size of tiles) (some) marginal frequencies (spacing visual grouping) RESIDUALS (shading) what associations have been omitted? SPIDA Michael Friendly

44 nway2 E.g., Joint independence, [DG][A] (null model, Admit as response) [G 2 (11) = 877.1]: Model: (DeptGender)(Admit) A B C D E F Admitted Male Rejected Female SPIDA Michael Friendly

45 nway2 Mosaic displays for multiway tables Visual fitting: Pattern of lack-of-fit (residuals) better model smaller residuals cleaning the mosaic better model empty cells best done interactively! Model: (DeptGender)(DeptAdmit) A B C D E F E.g., Add [Dept Admit] association Conditional independence: Fits poorly, overall (G 2 (6) = 21.74) But, only in Department A! Admitted Male Rejected Female SPIDA Michael Friendly

46 nway2 Sequential plots and models Mosaic for an n-way table hierarchical decomposition of association in a way analogous to sequential fitting in regression Joint cell probabilities are decomposed as p ijkl = First 2 terms mosaic for v 1 and v 2 {v 1 v 2 } {}}{ p i p j i p k ij p l ijk p n ijk }{{} {v 1 v 2 v 3 } First 3 terms mosaic for v 1, v 2 and v 3 Sequential models of joint independence additive decomposition of the total association, G 2 [v 1 ][v 2 ]...[v p ] (mutual independence), G 2 [v 1 ][v 2 ]...[v p ] = G2 [v 1 ][v 2 ] +G2 [v 1 v 2 ][v 3 ] +G2 [v 1 v 2 v 3 ][v 4 ] + +G2 [v 1...v p 1 ][v p ] SPIDA Michael Friendly

47 nway2 Sequential plots and models: Example Hair color x Eye color marginal table (ignoring Sex) (Hair)(Eye), G2 (9) = Brown Hazel Green Blue Black Brown Red Blond SPIDA Michael Friendly

48 nway2 Sequential plots and models: Example 3-way table, Joint Independence Model [Hair Eye] [Sex] (HairEye)(Sex), G2 (15) = Brown Hazel Green Blue M F Black Brown Red Blond SPIDA Michael Friendly

49 nway2 Sequential plots and models: Example 3-way table, Mutual Independence Model [Hair] [Eye] [Sex] (Hair)(Eye)(Sex), G2 (24) = Brown Hazel Green Blue M F Black Brown Red Blond SPIDA Michael Friendly

50 nway2 Sequential plots and models: Example (Hair)(Eye), G2 (9) = (HairEye)(Sex), G2 (15) = (Hair)(Eye)(Sex), G2 (24) = Brown Hazel Green Blue + Brown Hazel Green Blue = Brown Hazel Green Blue Black Brown Red Blond M F Black Brown Red Blond M F Black Brown Red Blond [Hair] [Eye] G 2 (9) = [Hair Eye] [Sex] G 2 (15) = [Hair] [Eye] [Sex] G 2 (24) = SPIDA Michael Friendly

51 nway2 Mosaic matrices Analog of scatterplot matrix for categorical data (Friendly, 1999) Shows all p(p 1) pairwise views in a coherent display Each pairwise mosaic shows bivariate (marginal) relation Fit: marginal independence Residuals: show marginal associations Direct visualization of the Burt matrix analyzed in multiple correspondence analysis for p categorical variables Hair Black Brown Red Blond Black Brown Red Blond Brown Haz Grn Blue Male Female Blue Grn Brown Haz Eye Grn Blue Brown Haz Black Brown Red Blond Male Female Male Female Male Female Sex Black Brown Red Blond Brown Haz Grn Blue SPIDA Michael Friendly

52 nway2 Hair Brown Haz Grn Blue Male Female Black Brown Red Blond Brown Haz Grn Blue Eye Male Female Black Brown Red Blond Brown Haz Grn Blue Male Female Male Female Sex SPIDA Michael Friendly Brown Haz Grn Blue Black Brown Red Blond Black Brown Red Blond

53 nway2 Berkeley data: Admit Male Female A B C D E F Admit Reject Male Female Gender A B C D E F Admit Reject Male Female A B C D A B C D E Dept SPIDA Michael Friendly E F F Male Female Admit Reject Admit Reject

54 nway2 Partial association, Partial mosaics Stratified analysis: How does the association between two (or more) variables vary over levels of other variables? Mosaic plots for the main variables show partial association at each level of the other variables. E.g., Hair color, Eye color BY Sex TABLES sex * hair * eye; Sex: Male Sex: Female Brown Hazel Green Blue Brown Hazel Green Blue Black Brown Red -3.3 Blond Black Brown Red Blond SPIDA Michael Friendly

55 nway2 Partial association, Partial mosaics Stratified analysis: For models of partial independence, A B at each level of (controlling for) C A B C k, partial G 2 s add to the overall G 2 for conditional independence, G 2 A B C = k G 2 A B C(k) Table 7: Partial and Overall conditional tests, Hair Eye Sex Model df G 2 p-value [Hair][Eye] Male [Hair][Eye] Female [Hair][Eye] Sex SPIDA Michael Friendly

56 nway2 Software for Mosaic Displays Demonstration web applet: Runs the current version of mosaics via a cgi-script Can run sample data, upload a data file, enter data in a form. Choose model fitting and display options (not all supported). SPIDA Michael Friendly

57 mossoft SPIDA Michael Friendly

58 mossoft SPIDA Michael Friendly

59 mossoft SPIDA Michael Friendly

60 mossoft Software for Mosaic Displays SAS software & documentation: Examples: Many in VCD and on web site SAS/IML modules: mosaics.sas SAS/IML program Enter frequency table directly in SAS/IML, or read from a SAS dataset. Most flexible: Select, collapse, reorder, re-label table levels using SAS/IML statements Specify structural 0s, fit specialized models (e.g., quasi-independence) Interface to models fit using PROC GENMOD SPIDA Michael Friendly

61 mossoft Software for Mosaic Displays Macro interface: mosaic macro, table macro, mosmat macro mosaic macro Easiest to use: Direct input from a SAS dataset No knowledge of SAS/IML required Reorder table variables; collapse, reorder table levels with table macro Convenient interface to partial mosaics (BY=) table macro Create frequency table from raw data Collapse, reorder table categories Re-code table categories using SAS formats, e.g., 1= Male 2= Female mosmat macro Mosaic matrices analog of scatterplot matrix (Friendly, 1999) SPIDA Michael Friendly

62 mossoft mosaic macro example: Berkeley data berkeley.sas 1 title Berkeley Admissions data ; 2 proc format; 3 value admit 1="Admitted" 0="Rejected" ; 4 value dept 1="A" 2="B" 3="C" 4="D" 5="E" 6="F"; 5 value $sex M = Male F = Female ; 6 data berkeley; 7 do dept = 1 to 6; 8 do gender = M, F ; 9 do admit = 1, 0; 10 input 11 output; 12 end; end; end; 13 /* -- Male -- - Female- */ 14 /* Admit Rej Admit Rej */ 15 datalines; /* Dept A */ /* B */ /* C */ /* D */ /* E */ /* F */ 22 ; SPIDA Michael Friendly

63 mossoft Data set berkeley: dept gender admit freq 1 M M F F M M F F M M F F M M F F M M F F M M F F SPIDA Michael Friendly

64 mossoft mosaic macro example: Berkeley data mosaic9m.sas 1 goptions hsize=7in vsize=7in; 2 3 %include catdata(berkeley); 4 5 *-- apply character formats to numeric table variables; 6 %table(data=berkeley, 7 var=admit Gender Dept, 8 weight=freq, 9 char=y, format=admit admit. gender $sex. dept dept., 10 order=data, out=berkeley); %mosaic(data=berkeley, 13 vorder=dept Gender Admit, /* reorder variables */ 14 plots=2:3, /* which plots? */ 15 fittype=joint, /* fit joint indep. */ 16 split=h V V, htext=3); /* options */ SPIDA Michael Friendly

65 mossoft mosaic macro example: Berkeley data Model: (Dept)(Gender) Model: (DeptGender)(Admit) A B C D E F A B C D E F Male Two-way, Dept. by Gender Female Admitted Rejected Male Female Three-way, Dept. by Gender by Admit SPIDA Michael Friendly

66 mossoft mosmat macro: Mosaic matrices mosmat9m.sas 1 %include catdata(berkeley); 2 %mosmat(data=berkeley, 3 vorder=admit Gender Dept, sort=no); Admit Admit Reject Admit Reject Male Female A B C D E F Male Female Gender Male Female Admit Reject A B C D E F F A B C D E A B C D E F Dept Admit Reject Male Female SPIDA Michael Friendly

67 mossoft Partial mosaics mospart3.sas 1 %include catdata(hairdat3s); 2 %gdispla(off); 3 %mosaic(data=haireye, 4 vorder=hair Eye Sex, by=sex, 5 htext=2, cellfill=dev); 6 %gdispla(on); 7 %panels(rows=1, cols=2); /* make 2 figs -> 1 */ Sex: Male Sex: Female Brown Hazel Green Blue Brown Hazel Green Blue Black Brown Red -3.3 Blond Black Brown Red Blond SPIDA Michael Friendly

68 corresp Correspondence analysis and MCA Correspondence analysis (CA): Analog of PCA for frequency data: account for maximum % of χ 2 in few (2-3) dimensions finds scores for row (x im ) and column (y jm ) categories on these dimensions uses Singular Value Decomposition of residuals from independence, d ij = (n ij m ij )/ m ij d ij n = M m=1 λ m x im y jm optimal scaling: each pair of scores, x im ) and column (y jm, have highest possible correlation (= λ m ). plots of the row (x im ) and column (y jm ) scores show associations MCA: Extends CA to n-way tables, but only uses bivariate associations (like mosaic matrix) SPIDA Michael Friendly

69 corresp Hair color, Eye color data: 0.5 * Eye color * HAIR COLOR RED Green Dim 2 (9.5%) 0.0 Brown BLACK Hazel BROWN Blue BLOND Dim 1 (89.4%) Interpretation: row/column points near each other are positively associated Dim 1: 89.4% of χ 2 (dark light) Dim 2: 9.5% of χ 2 (RED/Green vs. others) SPIDA Michael Friendly

70 corresp PROC CORRESP and the CORRESP macro Two forms of input dataset: dataset in contingency table form column variables are levels of one factor, observations (rows) are levels of the other. Obs Eye BLACK BROWN RED BLOND 1 Brown Blue Hazel Green Raw category responses (case form), or cell frequencies (frequency form), classified by 2 or more factors (e.g., output from PROC FREQ) Obs Eye HAIR Count 1 Brown BLACK 68 2 Brown BROWN Brown RED 26 4 Brown BLOND Green RED Green BLOND 16 SPIDA Michael Friendly

71 corresp PROC CORRESP PROC CORRESP and the CORRESP macro Handles 2-way CA, extensions to n-way tables, and MCA Many options for scaling row/column coordinates and output statistics OUTC= option output dataset for plotting (PROC CORRESP doesn t do plots itself) CORRESP macro Uses PROC CORRESP for analysis Produces labeled plots of the category points in either 2 or 3 dimensions Many graphic options; can equate axes automatically See: SPIDA Michael Friendly

72 corresp Example: Hair and Eye Color Input the data in contingency table form 1 data haireye; corresp2a.sas 2 input EYE $ BLACK BROWN RED BLOND ; 3 datalines; 4 Brown Blue Hazel Green ; SPIDA Michael Friendly

73 corresp Example: Hair and Eye Color Using PROC CORRESP directly labeled printer plot proc corresp data=haireye outc=coord short; id eye; /* row variable */ var black brown red blond; /* col variables */ proc plot data=coord vtoh=2; /* plot step */ plot dim2 * dim1 = * $eye / box haxis=by.1 vaxis=by.1; /* plot options */ Using the CORRESP macro labeled high-res plot %corresp (data=haireye, id=eye, /* row variable */ var=black brown red blond, /* col variables */ dimlab=dim); /* options */ SPIDA Michael Friendly

74 corresp Example: Hair and Eye Color Printed output: The Correspondence Analysis Procedure Inertia and Chi-Square Decomposition Singular Principal Chi- Values Inertias Squares Percents % ************************* % *** % (Degrees of Freedom = 9) Row Coordinates Dim1 Dim2 Brown Blue Hazel Green Column Coordinates Dim1 Dim2 BLACK BROWN RED BLOND SPIDA Michael Friendly

75 corresp Example: Hair and Eye Color Output dataset(selected variables): Obs _TYPE_ EYE DIM1 DIM2 1 INERTIA.. 2 OBS Brown OBS Blue OBS Hazel OBS Green VAR BLACK VAR BROWN VAR RED VAR BLOND Row and column points are distinguished by the _TYPE_ variable: OBS vs. VAR SPIDA Michael Friendly

76 corresp Example: Hair and Eye Color Graphic output from CORRESP macro: * Eye color * HAIR COLOR 0.5 RED Green Dim 2 (9.5%) 0.0 Brown BLACK Hazel BROWN Blue BLOND Dim 1 (89.4%) Top legend produced with Annotate data set and the INANNO= option to the CORRESP macro SPIDA Michael Friendly

77 corresp Multi-way tables Stacking approach: van der Heijden and de Leeuw (1985) three-way table, of size I J K can be sliced and stacked as a two-way table, of size (I J) K I x J x K table J (I J )x K table I J J K J The variables combined are treated interactively Each way of stacking corresponds to a loglinear model (I J) K [AB][C] I (J K) [A][BC] J (I K) [B][AC] K SPIDA Michael Friendly

78 corresp Multi-way tables: Stacking PROC CORRESP: Use TABLES statement and option CROSS=ROW or CROSS=COL. E.g., for model [A B] [C], proc corresp cross=row; tables A B, C; weight count; CORRESP macro: Can use / instead of, %corresp( options=cross=row, tables=a B/ C, weight count); SPIDA Michael Friendly

79 corresp Example: Suicide Rates Suicide rates in West Germany, by Age, Sex and Method of suicide Sex Age POISON GAS HANG DROWN GUN JUMP M M M M M F F F F F CA of the [Age Sex] by [Method] table: Shows associations between the Age-Sex combinations and Method Ignores association between Age and Sex SPIDA Michael Friendly

80 corresp Example: Suicide Rates suicide5.sas 1 %include catdata(suicide); 2 *-- equate axes!; 3 axis1 order=(-.7 to.7 by.7) length=6.5 in label=(a=90 r=0); 4 axis2 order=(-.7 to.7 by.7) length=6.5 in; 5 %corresp(data=suicide, weight=count, 6 tables=%str(age sex, method), 7 options=cross=row short, 8 vaxis=axis1, haxis=axis2); Output: Inertia and Chi-Square Decomposition Singular Principal Chi- Values Inertias Squares Percents % ************************* % ************** % ** % % (Degrees of Freedom = 45) SPIDA Michael Friendly

81 corresp 0.7 Gas * F * F * M Gun Dimension 2 (33.0%) 0.0 Poison * M * F * M Jump * F * F Hang * M Drown * M Dimension 1 (60.4%) SPIDA Michael Friendly

82 MALE FEMALE Gun Gas Hang Poison Jump Drown Loglinear Models for Categorical Data corresp Compare with mosaic display: Suicide data - Model (SexAge)(Method) >65 SPIDA Michael Friendly Standardized residuals: <-8-8:-4-4:-2-2:-0 0:2 2:4 4:8 >8

83 structured Ordered categories More structured tables Tables with ordered categories allow more parsimonious representations of association Can represent λ AB ij by a small number of parameters more focused and more powerful tests of lack of independence Square tables For square I I tables, where row and column variables have the same categories: Can ignore diagonal cells, where association is expected and test remaining association (quasi-independence) Can test whether association is symmetric around the diagonal cells. SPIDA Michael Friendly

84 structured Ordered categories Ordinal scores In many cases it may be reasonable to assign numeric scores, {a i } to an ordinal row variable and/or numeric scores, {b i } to an ordinal column variable. Typically, scores are equally spaced and sum to zero, {a i } = i (I + 1)/2, e.g., {a i } = { 1, 0, 1} for I=3. Linear-by-Linear (Uniform) Association: When both variables are ordinal, the simplest model posits that any association is linear in both variables. λ AB ij = γ a i b j This model adds one additional parameter to the independence model (γ = 0). It is similar to For integer scores, the local log odds ratio for any contiguous 2 2 table are all equal, log θ ij = γ This is a model of uniform association SPIDA Michael Friendly

85 structured Row Effects and Column Effects: When only one variable is assigned scores, we have the row effects model or the column effects model. E.g., in the row effects model, the row variable (A) is treated as nominal, while the column variable (B) is assigned ordered scores {b j }. log m ij = µ + λ A i + λ B j + α i b j where the row parameters, α i, are defined so they sum to zero. This model has (I 1) more parameters than the independence model. A Row Effects + Column Effects model allows both variables to be ordered, but not necessarily with linear scores. Fitting models for ordinal variables Create numeric variables for category scores PROC GENMOD: Use as quantitative variables in MODEL statement, but not listed as CLASS variables SPIDA Michael Friendly

86 structured Example: Mental impairment and parents SES Srole et al. (1978) Data on mental health status of 1600 young NYC residents in relation to parents SES. Mental health: Well, mild symptoms, moderate symptoms, Impaired SES: 1 (High) 6 (Low) Mental Parents SES health High Low 1: Well : Mild : Moderate : Impaired SPIDA Michael Friendly

87 structured Mental Impairment and SES: Independence Linear x Linear Mental Well Mild Moderate Impaired Mental Well Mild Moderate Impaired High Low SES High Low SES Visual assessment: Residuals from the independence model show an opposite-corner pattern. This is consistent with both: Linear linear model: equi-spaced scores for both Mental and SES Row effects model: equi-spaced scores for SES, ordered scores for Mental SPIDA Michael Friendly

88 structured Statistical assesment: Table 8: Mental health data: Goodness-of-fit statistics for ordinal loglinear models Model G 2 df Pr(> G 2 ) AIC AIC-best Independence Col effects (SES) Row effects (mental) Lin x Lin Both the Row Effects and Linear linear models are significantly better than the Independence model AIC indicates a slight preference for the Linear linear model In the Linear linear model, the estimate of the coefficient of a i b j is ˆγ = = log θ, so ˆθ = exp(0.0907) = each step down the SES scale increases the odds of being classified one step poorer in mental health by 9.5%. SPIDA Michael Friendly

89 structured Fitting these models with PROC GENMOD: mentgen2.sas 1 %include catdata(mental); 2 data mental; 3 set mental; 4 m_lin = mental; *-- copy m_lin and s_lin for; 5 s_lin = ses; *-- use non-class variables; 6 7 title Independence model ; 8 proc genmod data=mental; 9 class mental ses; 10 model count = mental ses / dist=poisson obstats residuals; 11 format mental mental. ses ses.; 12 ods output obstats=obstats; 13 %mosaic(data=obstats, vorder=mental SES, resid=stresdev, 14 title=mental Impairment and SES: Independence, split=h V); Row Effects model: mentgen2.sas 16 proc genmod data=mental; 17 class mental ses; 18 model count = mental ses mental*s_lin / dist=poisson obstats; Linear linear model: mentgen2.sas 21 proc genmod data=mental; 22 class mental ses; 23 model count = mental ses m_lin*s_lin / dist=poisson obstats; SPIDA Michael Friendly

90 square Square-ish tables Tables where two (or more) variables have the same category levels: Employment categories of related persons (mobility tables) Multiple measurements over time (panel studies; longitudinal data) Repeated measures on the same individuals under different conditions Related/repeated measures are rarely independent, but may have simpler forms than general association E.g., vision data: Left and right eye acuity grade for 7477 women Independence, G2(9)= Low RightEye High 2 3 High 2 3 LeftEye Low SPIDA Michael Friendly

91 square Quasi-Independence Related/repeated measures are rarely independent most observations often fall on diagonal cells. Quasi-independence ignores diagonals, tests independence in remaining cells (λ ij = 0 for i j). The model dedicates one parameter (δ i ) to each diagonal cell, fitting them exactly, where I( ) is the indicator function. log m ij = µ + λ A i + λ B j + δ i I(i = j) This model may be fit as a GLM by including indicator variables for each diagonal cell. DIAG 4 rows 4 cols SPIDA Michael Friendly

92 square Using PROC GENMOD mosaic10g.sas 1 title Quasi-independence model (women) ; 2 proc genmod data=women; 3 class RightEye LeftEye diag; 4 model Count = LeftEye RightEye diag / 5 dist=poisson link=log obstats residuals; 6 ods output obstats=obstats; 7 %mosaic(data=obstats, vorder=righteye LeftEye,...); Mosaic: Quasi-Independence, G2(5)=199.1 Low RightEye High 2 3 High 2 3 LeftEye Low SPIDA Michael Friendly

93 square Symmetry Tests whether the table is symmetric around the diagonal, i.e., m ij = m ji As a loglinear model, symmetry is log m ij = µ + λ A i + λ B j + λ AB ij, subject to the conditions λ A i = λ B j and λ AB ij = λ AB ji. This model may be fit as a GLM by including indicator variables with equal values for symmetric cells, and indicators for the diagonal cells (fit exactly) SYMMETRY 4 rows 4 cols) SPIDA Michael Friendly

94 square Using PROC GENMOD mosaic10g.sas 1 proc genmod data=women; 2 class symmetry; 3 model Count = symmetry / 4 dist=poisson link=log obstats residuals; 5 ods output obstats=obstats; 6 %mosaic(data=obstats, vorder=righteye LeftEye,...); Mosaic: Symmetry, G2(6)=19.25 Low RightEye High 2 3 High 2 3 LeftEye Low SPIDA Michael Friendly

95 square Quasi-Symmetry Symmetry is often too restrictive: equal marginal frequencies (λ A i = λ B i ) PROC GENMOD: Use the usual marginal effect parameters + symmetry: mosaic10g.sas 1 proc genmod data=women; 2 class LeftEye RightEye symmetry; 3 model Count = LeftEye RightEye symmetry / 4 dist=poisson link=log obstats residuals; 5 ods output obstats=obstats; Quasi-Symmetry, G2(3)=7.27 Low RightEye High 2 3 High 2 3 LeftEye Low SPIDA Michael Friendly

96 square Table 9: Summary of models fit to vision data Model G 2 df Pr(> G 2 ) AIC AIC - min(aic) Independence Linear*Linear Row+Column Effects Quasi-Independence Symmetry Quasi-Symmetry Ordinal Quasi-Symmetry Only the quasi-symmetry models provide an acceptable fit The ordinal quasi-symmetry model is most parsimonious SPIDA Michael Friendly

97 loglin References Bickel, P. J., Hammel, J. W., and O Connell, J. W. Sex bias in graduate admissions: Data from Berkeley. Science, 187: , Friendly, M. SAS System for Statistical Graphics. SAS Institute, Cary, NC, 1st edition, Friendly, M. Mosaic displays for multi-way contingency tables. Journal of the American Statistical Association, 89: , Friendly, M. Conceptual and visual models for categorical data. The American Statistician, 49: , Friendly, M. Extending mosaic displays: Marginal, conditional, and partial views of categorical data. Journal of Computational and Graphical Statistics, 8(3): , Friendly, M. Visualizing Categorical Data. SAS Institute, Cary, NC, Friendly, M. Corrgrams: Exploratory displays for correlation matrices. The American Statistician, 56(4): , Friendly, M. and Kwan, E. Effect ordering for data displays. Computational Statistics and Data Analysis, 43(4): , Hartigan, J. A. and Kleiner, B. Mosaics for contingency tables. In Eddy, W. F., editor, Computer Science and Statistics: Proceedings of the 13th Symposium on the Interface, pp Springer-Verlag, New York, NY, Srole, L., Langner, T. S., Michael, S. T., Kirkpatrick, P., Opler, M. K., and Rennie, T. A. C. Mental Health in the Metropolis: The Midtown Manhattan Study. NYU Press, New York, SPIDA Michael Friendly

98 loglin Tufte, E. R. The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT, van der Heijden, P. G. M. and de Leeuw, J. Correspondence analysis used complementary to loglinear analysis. Psychometrika, 50: , SPIDA Michael Friendly

Applied Multivariate Analysis

Applied Multivariate Analysis Department of Mathematics and Statistics, University of Vaasa, Finland Spring 2017 Choosing Statistical Method 1 Choice an appropriate method 2 Cross-tabulation More advance analysis of frequency tables