Loglinear Models for Categorical Data. Michael Friendly
|
|
- Eileen Bradley
- 5 years ago
- Views:
Transcription
1 Admit?: Yes Sex: Male Admit?: No Right Eye Grade High 2 3 Unaided distant vision data Brown Hazel Green Blue Sex: Female Low High 2 3 Left Eye Grade Low Black Brown Red Blond Michael Friendly York University, <friendly@yorku.ca> SPIDA 2004
2 outline Lecture Outline Introduction Graphical methods for Categorical Data Loglinear models: Prelimaries Overview of loglinear models Visualizing contingency tables and loglinear models 2 2 tables Larger tables n-way tables Structured tables Ordinal variables Square tables SPIDA Michael Friendly
3 outline Resources SAS macro programs, from SAS System for Statistical Graphics (Friendly, 1991): SAS macro programs, from Visualizing Categorical Data (Friendly, 2000): Workshop notes: SCS Short Course notes, Categorical Data Analysis with Graphics SPIDA Michael Friendly
4 overview2 Graphical Methods for Categorical Data If I can t picture it, I can t understand it. Albert Einstein Getting information from a table is like extracting sunlight from a cucumber. Farquhar & Farquhar, 1891 Tables vs. Graphs Tables are best suited for look-up read off exact numbers Graphs are better for showing patterns, trends, anomalies, making comparisons Visual presentation as communication: what do you want to say? SPIDA Michael Friendly
5 overview2 Visual metaphors Quantitative data: magnitude position along an axis Frequency data: count area (Friendly, 1995) Model: (DeptGender)(Admit) Admit?: Yes Sex: Male Admit?: No A B C D E F Sex: Female Admitted Male Rejected Female Fourfold display for 2 2 table Mosaic plot for 3-way table SPIDA Michael Friendly
6 Auto data: PC2/1 order Displa Gratio Hroom Length MPG Price Rep77 Rep78 Rseat Trunk Turn Weight Gratio MPG Rep78 Rep77 Price Hroom Trunk Rseat Length Weight Displa Turn Turn Displa Weight Length Rseat Trunk Hroom Price Rep77 Rep78 MPG Gratio Weight Turn Trunk Rseat Rep78 Rep77 Price MPG Length Hroom Gratio Displa Loglinear Models for Categorical Data overview2 Principles of Graphical Displays Effect ordering (Friendly and Kwan, 2003) In tables and graphs, sort unordered factors according to the effects you want to see/show. Auto data: Alpha order Corrgrams: Exploratory displays for correlation matrices (Friendly, 2002) SPIDA Michael Friendly
7 overview2 Comparisons Make visual comparisons easy Visual grouping connect with lines, make key comparisons contiguous Baselines compare data to model against a line, preferably horizontal Frequency Sqrt(frequency) Number of Occurrences Number of Occurrences SPIDA Michael Friendly
8 overview2 Small multiples combine stratified graphs into coherent displays (Tufte, 1983) e.g., scatterplot matrix for quantitative data: all pairwise scatterplots 87.2 Prestige Educ Income Women 0 SPIDA Michael Friendly
9 overview2 e.g., mosaic matrix for quantitative data: all pairwise mosaic plots Admit Male Female A B C D E F Admit Reject A B C D E F Male Female Gender Admit Reject Male Female A B A B C C Dept SPIDA Michael Friendly D D E E F F Male Female Admit Reject Admit Reject
10 prelim Loglinear models: Preliminaries Categorical data: Variables with a finite number of discrete values Binary: pass/fail, live/die, vote yes/vote no Nominal: unordered categories Religion: Christian, Jewish, Muslim,... Party preference: Conservative, Liberal, NDP, Bloc Québécois, Green Party Ordinal: ordered categories Education: Primary, Secondary, University, Post-graduate Income ranges: 0-20, 20-30, 30-40,... Likert scale: strongly disagree strongly agree Interval: Counts: number of children 0, 1, 2,... SPIDA Michael Friendly
11 prelim Loglinear models: Preliminaries Data representations: Case form: One row per case (observation), one column per variable standard data table (Quantitative data is always represented in case form) Table form: A set of n variables represented as an n-way table Table 1: Admissions to Berkeley graduate programs Admitted Rejected Total Males Females Total SPIDA Michael Friendly
12 prelim 3+-way tables Table 2: Berkeley data: 3-way table Male Female Dept Admit Reject Admit Reject A B C D E F SPIDA Michael Friendly
13 prelim Frequency form: One row per combination of variable categories 1 Obs Admit Gender Dept count Admitted Male A Admitted Male B Admitted Male C Admitted Male D Admitted Male E Admitted Male F Admitted Female A Admitted Female B Admitted Female C Admitted Female D Admitted Female E Admitted Female F Rejected Female E Rejected Female F 317 SPIDA Michael Friendly
14 prelim Loglinear models: Preliminaries Data transformations: All SAS procedures take input in case form. Most SAS procedures take input in frequency form need to use a FREQ or WEIGHT statement to identify the frequency (count) variable. Categorical variables can be represented by numeric or character values Character variables are ordered alphabetically by default often not what you want (e.g., Low, Medium, High High, Low, Medium) Often easiest to use numeric values, and create a SAS format to attach value labels: 1 proc format; 2 value inc 1= Low 2= Medium 3= High ; 3 proc freq; 4 tables educ * income; 5 format income inc.; SPIDA Michael Friendly
15 prelim Recoding categories Categories can be collapsed (frequencies summed) using SAS formats 1 proc format; 2 value edu 1-8= Primary 9-12= Secondary = University 16-high= Post-grad ; 4 proc freq; 5 tables educ * income; 6 format educ edu. income inc.; Reordering categories Many SAS procedures allow several different ways to order the categories of a discrete variable ORDER=INTERNAL by numeric or character value (default). ORDER=DATA by order of appearance in the dataset. ORDER=FORMATTED by order of formatted value. ORDER=FREQ by order of frequency of occurrence. The table macro Collapse or recode variables SPIDA Michael Friendly
16 overview Related perspectives Loglinear models: Overview Loglinear models can be developed as an analog of classical ANOVA and regression models, where multiplicative relations (under independence) are re-expressed in additive form as models for log(frequency). Alternatively, but somewhat more generally, loglinear models are also GLMs for log(frequency), with a Poisson distribution for the cell counts. Two-way tables: Loglinear approach For two discrete variables, A and B, suppose a multinomial sample of total size n over the IJ cells of a two-way I J contingency table, with cell frequencies n ij, and cell probabilities π ij = n ij /n. The table variables are statistically independent when the cell (joint) probability equals the product of the marginal probabilities, Pr(A = i & B = j) = Pr(A = i) Pr(B = j), or, π ij = π i+ π +j. An equivalent model in terms of expected frequencies, m ij = nπ ij is m ij = n m i+ m +j. SPIDA Michael Friendly
17 overview This multiplicative model can be expressed in additive form as a model for log m ij, log m ij = log n + log m i+ + log m +j. (1) By anology with ANOVA models, the independence model (1) can be expressed as log m ij = µ + λ A i + λ B j, (2) where µ is the grand mean of log m ij and the parameters λ A i and λ B j express the marginal frequencies of variables A and B, and are typically defined so that i λa i = j λb j = 0. Dependence between the table variables is expressed by adding association parameters, λ AB ij, giving the saturated model, log m ij = µ + λ A i + λ B j + λ AB ij. (3) The saturated model fits the table perfectly ( m ij = n ij ): there are as many parameters as cell frequencies. A global test for association tests H 0 : λ AB ij = 0. For ordinal variables, the λ AB may be structured more simply, giving tests for ordinal association. ij SPIDA Michael Friendly
18 overview Two-way tables: GLM approach In the GLM approach, the vector of cell frequencies, n = {n ij } is specified to have a Poisson distribution with means m = {m ij } given by log m = Xβ where X is a known design (model) matrix and β is a column vector containing the unknown λ parameters. For example, for a 2 2 table, the saturated model (3) with the usual zero-sum constraints can be represented as log m 11 log m 12 log m 21 log m 22 = µ λ A 1 λ B 1 λ AB 11 Note that only the linearly independent parameters are represented. λ A 2 = λ A 1, because λa 1 + λ A 2 = 0, and so forth. An advantage of the GLM formulation is that it makes it somewhat easier to express models with ordinal or quantitative variables. SPIDA Michael Friendly
19 overview Three-way Tables Saturated model: For a 3-way table, of size I J K for variables A, B, C, the saturated loglinear model includes associations between all pairs of variables, as well as a 3-way association term, λ ABC ijk log m ijk = µ + λ A i + λ B j + λ C k + λ AB ij + λ AC ik + λ BC jk + λ ABC ijk. (4) As before, one-way terms (λ A i, λb j, λc k ij, λ AC ik, λbc jk ) pertain to differences in the marginal frequencies of the table variables. Two-way terms (λ AB ) pertain to the partial association for each pair of variables, controlling for the remaining variable. The three-way term, λ ABC allows the partial association between any pair of ijk variables to vary over the categories of the third variable. Such models are usually hierarchical: the presence of a high-order term, such as λ ABC implies that all low-order relatives are automatically included. ijk Thus, a short-hand notation for a loglinear model lists only the high-order terms, i.e., model (4) [ABC] SPIDA Michael Friendly
20 overview Reduced models: The usual goal is to fit the smallest model (fewest high-order terms) that is sufficient to explain/describe the observed frequencies. All representative subset models are shown in the table below Table 3: Log-linear Models for Three-Way Tables Model Model symbol Interpretation Mutual independence [A][B][C] A B C Joint independence [AB][C] (A B) C Conditional independence [AC][BC] (A B) C All two-way associations [AB][AC][BC] homogeneous association Saturated model [ABC] interaction e.g., [AB][C] log m ijk = µ + λ A i + λ B j + λ C k + λ AB ij SPIDA Michael Friendly
21 overview Assessing goodness of fit Goodness of fit of a specified model may be tested by the likelihood ratio G 2, G 2 = 2 i n i log(n i / m i ), (5) or the Pearson χ 2, χ 2 = i (n i m i ) 2 m i, (6) with degrees of freedom = # cells - # estimated parameters. E.g., for the model of mutual independence, [A][B][C], df = IJK (I 1) (J 1) (K 1) = (I 1)(J 1)(K 1) The terms summed in (5) and (6) are the squared cell residuals Other measures of balance goodness of fit against parsimony, e.g., Akaike s Information Criterion (smaller is better) AIC = G 2 2df or AIC = G # parameters SPIDA Michael Friendly
22 vismethods Visualizing Contingency tables Two-way tables 2 2 tables Visualize odds ratio (FFOLD macro) 2 2 k tables Homogeneity of association r 3 tables Trilinear plots (TRIPLOT macro) r c tables Visualize association (SIEVE program) r c tables Visualize association (MOSAIC macro) Square r r tables Visualize agreement (AGREE program) n-way tables Fit loglinear models, visualize lack-of-fit (MOSAIC macro) Test & visualize partial association (MOSAIC macro) Visualize pairwise association (MOSMAT macro) Visualize conditional association (MOSMAT macro) Visualize loglinear structure (MOSMAT macro) Correspondence analysis and MCA (CORRESP macro) SPIDA Michael Friendly
23 2by2 2 2 tables 2 2 tables provide the simplest context to illustrate loglinear models and their relations to other tests and modeling techniques. Example: Bickel et al. (1975): admissions to graduate depatments at U. C. Berkeley in 1973 Aggregate data for the six largest departments: Table 4: Admissions to Berkeley graduate programs Admitted Rejected Total % Admitted Males Females Total Equivalent questions: Proportions: Are the proportions of males and females admitted the same? Method: Test of independent proportions, H 0 : p 1 = p 2 Answer: z = ˆp 1 ˆp 2 = p(1 p)(1/n1 +1/n 2 ) = (.6163)(1/2691+1/1835) SPIDA Michael Friendly
24 2by2 2 2 tables Odds: Are the odds of admission the same for males and females? Odds(Admit Male) Method: Odds ratio, H 0 : θ = Odds(Admit = 1 Female) Odds(Admit Male) Answer: θ = Odds(Admit = 1198/1493 Female) 557/1276 = 1.84 Males 84% more likely to be admitted. Is admission independent of gender? Method: χ 2 test of independence, H 0 : p ij = p i p j Answer: χ 2 (1) = (n ij m ij ) 2 m ij = 92.21, p < Does cell frequency depend only on marginal frequencies of gender and admission? Method: Loglinear model, log(m ij ) = µ + λ A i + λ G j + λag ij Answer: H 0 : λ AG ij = 0 LR G 2 (1) = 93.45, p < SPIDA Michael Friendly
25 2by2 Standard analysis: PROC FREQ proc freq data=berkeley; weight freq; tables gender*admit / chisq; Output: Statistics for Table of gender by admit Statistic DF Value Prob Chi-Square <.0001 Likelihood Ratio Chi-Square <.0001 Continuity Adj. Chi-Square <.0001 Mantel-Haenszel Chi-Square <.0001 Phi Coefficient How to visualize and interpret? SPIDA Michael Friendly
26 2by2 Fourfold displays for 2 2 tables Quarter circles: radius n ij area frequency Independence: Adjoining quadrants align Odds ratio: ratio of areas of diagonally opposite cells Confidence rings: Visual test of H 0 : θ = 1 adjoining rings overlap Sex: Male Admit?: Yes Admit?: No Sex: Female Confidence rings do not overlap: θ 1 SPIDA Michael Friendly
27 2by2 Fourfold displays for 2 2 k tables Data in Table 4 had been pooled over departments Stratified analysis: one fourfold display for each department Each 2 2 table standardized to equate marginal frequencies Shading: highlight departments for which H a : θ i 1 Department: A Sex: Male Department: B Sex: Male Department: C Sex: Male Admit?: Yes Admit?: No Admit?: Yes Admit?: No Admit?: Yes Admit?: No Sex: Female Department: D Sex: Male Sex: Female Department: E Sex: Male Sex: Female Department: F Sex: Male Admit?: Yes Admit?: No Admit?: Yes Admit?: No Admit?: Yes Admit?: No Sex: Female Sex: Female Sex: Female Only one department (A) shows association; θ A = women (0.349) 1 = 2.86 times as likely as men to be admitted. SPIDA Michael Friendly
28 2by2 What happened here? Simpson s paradox: Aggregate data are misleading because they falsely assume men and women apply equally in each field. But: Large differences in admission rates across departments. Men and women apply to these departments differentially. Women applied in large numbers to departments with low admission rates. (This ignores possibility of structural bias against women: differential funding of fields to which women are more likely to apply.) Other graphical methods can show these effects. SPIDA Michael Friendly
29 2by2 The FOURFOLD program and the FFOLD macro The FOURFOLD program is written in SAS/IML. The FFOLD macro provides a simpler interface. Printed output: (a) significance tests for individual odds ratios, (b) tests of homogeneity of association (here, over departments) and (c) conditional association (controlling for department). Plot by department: 1 %include catdata(berkeley); 2 berk4f.sas 3 %ffold(data=berkeley, 4 var=admit Gender, /* panel variables */ 5 by=dept, /* stratify by dept */ 6 down=2, across=3, /* panel arrangement */ 7 htext=2); /* font size */ Aggregate data: first sum over departments, using the TABLE macro: 8 %table(data=berkeley, out=berk2, 9 var=admit Gender, /* omit dept */ 10 weight=count, /* frequency variable */ 11 order=data); 12 %ffold(data=berk2, var=admit Gender); SPIDA Michael Friendly
30 2by2 The FOURFOLD program and the FFOLD macro Printed output: Odds (Admitted Male) / (Admitted Female) Odds Ratio Log Odds SE(LogOdds) Z Pr> Z A B C D E F SPIDA Michael Friendly
31 2by2 For a 2 2 k table, additional tests are also of interest: Homogeneity of odds ratios: H 0 : θ 1 = θ 2 = θ k (This is equivalent to the loglinear model of no 3-way association, [AG][AD][DG].) Conditional independence: Admit Gender Dept Test of Homogeneity of Odds Ratios (no 3-Way Association) TEST CHISQ DF PROB Homogeneity of Odds Ratios Conditional Independence of Admit and Gender Dept (assuming Homogeneity) TEST CHISQ DF PROB Likelihood-Ratio Cochran-Mantel-Haenszel SPIDA Michael Friendly
32 twoway Two-way frequency tables Table 5: Hair-color eye-color data Eye Hair Color Color Black Brown Red Blond Total Green Hazel Blue Brown Total SPIDA Michael Friendly
33 twoway Two-way frequency tables: PROC FREQ proc freq data=haireye; weight count; tables Eye * HAIR / chisq; Output: Statistics for Table of Eye by HAIR Statistic DF Value Prob Chi-Square <.0001 Likelihood Ratio Chi-Square <.0001 Mantel-Haenszel Chi-Square <.0001 Phi Coefficient SPIDA Michael Friendly
34 twoway Two-way frequency tables: PROC GENMOD We can fit the same model as a Poisson GLM for count: proc genmod data=haireye; class hair eye; model count = hair eye hair*eye / dist=poisson type3; Output: LR Statistics For Type 3 Analysis Chi- Source DF Square Pr > ChiSq hair <.0001 eye <.0001 hair*eye <.0001 How to visualize and interpret? SPIDA Michael Friendly
35 twoway count area Two-way frequency tables: Sieve diagrams When row/col variables are independent, n ij n i+ n +j each cell can be represented as a rectangle, with area = height width frequency, n ij Green Expected frequencies: Hair Eye Color Data Hazel Eye Color Blue Brown Black 286 Brown Hair Color 71 Red 127 Blond 592 SPIDA Michael Friendly
36 twoway Sieve diagrams Height/width marginal frequencies, n i+, n +j Area expected frequency, n i+ n +j Shading observed frequency, n ij, color: sign(n ij ˆm ij ). Independence: Shown when density of shading is uniform. Green Hazel Eye Color Blue Brown Black Brown Red Blond Hair Color SPIDA Michael Friendly
37 twoway Sieve diagrams Effect ordering: Reorder rows/cols to make the pattern coherent Blue Eye Color Green Hazel Brown Black Brown Red Blond Hair Color SPIDA Michael Friendly
38 twoway Sieve diagrams Vision classification data for 7477 women Unaided distant vision data High Right Eye Grade 2 3 Low High 2 3 Left Eye Grade Low SPIDA Michael Friendly
39 nway2 Mosaic displays and Log-linear Models Hartigan and Kleiner (1981), Friendly (1994, 1999): Width one set of marginals, n i+ Height relative proportions of other variable, p j i = n ij /n i+ area frequency, n ij = n i+ p j i Model: (Gender)(Admit) Admitted Rejected Standardized residuals: <-4-4:-2-2:-0 0:2 2:4 >4 Male Female SPIDA Michael Friendly
40 nway2 Shading: Sign and magnitude of Pearson χ 2 residual, d ij = (n ij ˆm ij )/ ˆm ij (or L.R. G 2 ) Sign: negative in red; + positive in blue Magnitude: intensity of shading: d ij > 0, 2, 4,... Independence: Rows align, or cells are empty! E.g., aggregate Berkeley data, independence model: Model: (Gender)(Admit) Admitted Rejected Standardized residuals: <-4-4:-2-2:-0 0:2 2:4 >4 Male Female SPIDA Michael Friendly
41 nway2 Mosaic displays Departments Gender: Did departments differ in the total number of applicants? Did men and women apply differentially to departments? Model: (Dept)(Gender) A B C D E F Model [Dept] [Gender]: G 2 (5) = Note: Departments ordered A F by overall rate of admission. Male Female SPIDA Michael Friendly
42 nway2 Mosaic displays: Hair color and eye color Brown Hazel Green Blue Dark hair goes with dark eyes, light hair with light eyes Red hair, hazel eyes an exception? Effect ordering: Rows/cols permuted by CA Dimension Black Brown Red Blond SPIDA Michael Friendly
43 nway2 Mosaic displays for multiway tables Generalizes to n-way tables: divide cells recursively Can fit any log-linear model (e.g., 3-way), Table 6: Log-linear Models for Three-Way Tables Model Model symbol Independence interpretation Mutual independence [A][B][C] A B C Joint independence [AB][C] (A B) C Conditional independence [AC][BC] (A B) C All two-way associations [AB][AC][BC] (none) Saturated model [ABC] (none) e.g., the model for conditional independence (A C B): [AB][BC] log m ijk = µ + λ A i + λ B j + λ C k + λ AB ij + λ BC jk Each mosaics shows: DATA (size of tiles) (some) marginal frequencies (spacing visual grouping) RESIDUALS (shading) what associations have been omitted? SPIDA Michael Friendly
44 nway2 E.g., Joint independence, [DG][A] (null model, Admit as response) [G 2 (11) = 877.1]: Model: (DeptGender)(Admit) A B C D E F Admitted Male Rejected Female SPIDA Michael Friendly
45 nway2 Mosaic displays for multiway tables Visual fitting: Pattern of lack-of-fit (residuals) better model smaller residuals cleaning the mosaic better model empty cells best done interactively! Model: (DeptGender)(DeptAdmit) A B C D E F E.g., Add [Dept Admit] association Conditional independence: Fits poorly, overall (G 2 (6) = 21.74) But, only in Department A! Admitted Male Rejected Female SPIDA Michael Friendly
46 nway2 Sequential plots and models Mosaic for an n-way table hierarchical decomposition of association in a way analogous to sequential fitting in regression Joint cell probabilities are decomposed as p ijkl = First 2 terms mosaic for v 1 and v 2 {v 1 v 2 } {}}{ p i p j i p k ij p l ijk p n ijk }{{} {v 1 v 2 v 3 } First 3 terms mosaic for v 1, v 2 and v 3 Sequential models of joint independence additive decomposition of the total association, G 2 [v 1 ][v 2 ]...[v p ] (mutual independence), G 2 [v 1 ][v 2 ]...[v p ] = G2 [v 1 ][v 2 ] +G2 [v 1 v 2 ][v 3 ] +G2 [v 1 v 2 v 3 ][v 4 ] + +G2 [v 1...v p 1 ][v p ] SPIDA Michael Friendly
47 nway2 Sequential plots and models: Example Hair color x Eye color marginal table (ignoring Sex) (Hair)(Eye), G2 (9) = Brown Hazel Green Blue Black Brown Red Blond SPIDA Michael Friendly
48 nway2 Sequential plots and models: Example 3-way table, Joint Independence Model [Hair Eye] [Sex] (HairEye)(Sex), G2 (15) = Brown Hazel Green Blue M F Black Brown Red Blond SPIDA Michael Friendly
49 nway2 Sequential plots and models: Example 3-way table, Mutual Independence Model [Hair] [Eye] [Sex] (Hair)(Eye)(Sex), G2 (24) = Brown Hazel Green Blue M F Black Brown Red Blond SPIDA Michael Friendly
50 nway2 Sequential plots and models: Example (Hair)(Eye), G2 (9) = (HairEye)(Sex), G2 (15) = (Hair)(Eye)(Sex), G2 (24) = Brown Hazel Green Blue + Brown Hazel Green Blue = Brown Hazel Green Blue Black Brown Red Blond M F Black Brown Red Blond M F Black Brown Red Blond [Hair] [Eye] G 2 (9) = [Hair Eye] [Sex] G 2 (15) = [Hair] [Eye] [Sex] G 2 (24) = SPIDA Michael Friendly
51 nway2 Mosaic matrices Analog of scatterplot matrix for categorical data (Friendly, 1999) Shows all p(p 1) pairwise views in a coherent display Each pairwise mosaic shows bivariate (marginal) relation Fit: marginal independence Residuals: show marginal associations Direct visualization of the Burt matrix analyzed in multiple correspondence analysis for p categorical variables Hair Black Brown Red Blond Black Brown Red Blond Brown Haz Grn Blue Male Female Blue Grn Brown Haz Eye Grn Blue Brown Haz Black Brown Red Blond Male Female Male Female Male Female Sex Black Brown Red Blond Brown Haz Grn Blue SPIDA Michael Friendly
52 nway2 Hair Brown Haz Grn Blue Male Female Black Brown Red Blond Brown Haz Grn Blue Eye Male Female Black Brown Red Blond Brown Haz Grn Blue Male Female Male Female Sex SPIDA Michael Friendly Brown Haz Grn Blue Black Brown Red Blond Black Brown Red Blond
53 nway2 Berkeley data: Admit Male Female A B C D E F Admit Reject Male Female Gender A B C D E F Admit Reject Male Female A B C D A B C D E Dept SPIDA Michael Friendly E F F Male Female Admit Reject Admit Reject
54 nway2 Partial association, Partial mosaics Stratified analysis: How does the association between two (or more) variables vary over levels of other variables? Mosaic plots for the main variables show partial association at each level of the other variables. E.g., Hair color, Eye color BY Sex TABLES sex * hair * eye; Sex: Male Sex: Female Brown Hazel Green Blue Brown Hazel Green Blue Black Brown Red -3.3 Blond Black Brown Red Blond SPIDA Michael Friendly
55 nway2 Partial association, Partial mosaics Stratified analysis: For models of partial independence, A B at each level of (controlling for) C A B C k, partial G 2 s add to the overall G 2 for conditional independence, G 2 A B C = k G 2 A B C(k) Table 7: Partial and Overall conditional tests, Hair Eye Sex Model df G 2 p-value [Hair][Eye] Male [Hair][Eye] Female [Hair][Eye] Sex SPIDA Michael Friendly
56 nway2 Software for Mosaic Displays Demonstration web applet: Runs the current version of mosaics via a cgi-script Can run sample data, upload a data file, enter data in a form. Choose model fitting and display options (not all supported). SPIDA Michael Friendly
57 mossoft SPIDA Michael Friendly
58 mossoft SPIDA Michael Friendly
59 mossoft SPIDA Michael Friendly
60 mossoft Software for Mosaic Displays SAS software & documentation: Examples: Many in VCD and on web site SAS/IML modules: mosaics.sas SAS/IML program Enter frequency table directly in SAS/IML, or read from a SAS dataset. Most flexible: Select, collapse, reorder, re-label table levels using SAS/IML statements Specify structural 0s, fit specialized models (e.g., quasi-independence) Interface to models fit using PROC GENMOD SPIDA Michael Friendly
61 mossoft Software for Mosaic Displays Macro interface: mosaic macro, table macro, mosmat macro mosaic macro Easiest to use: Direct input from a SAS dataset No knowledge of SAS/IML required Reorder table variables; collapse, reorder table levels with table macro Convenient interface to partial mosaics (BY=) table macro Create frequency table from raw data Collapse, reorder table categories Re-code table categories using SAS formats, e.g., 1= Male 2= Female mosmat macro Mosaic matrices analog of scatterplot matrix (Friendly, 1999) SPIDA Michael Friendly
62 mossoft mosaic macro example: Berkeley data berkeley.sas 1 title Berkeley Admissions data ; 2 proc format; 3 value admit 1="Admitted" 0="Rejected" ; 4 value dept 1="A" 2="B" 3="C" 4="D" 5="E" 6="F"; 5 value $sex M = Male F = Female ; 6 data berkeley; 7 do dept = 1 to 6; 8 do gender = M, F ; 9 do admit = 1, 0; 10 input 11 output; 12 end; end; end; 13 /* -- Male -- - Female- */ 14 /* Admit Rej Admit Rej */ 15 datalines; /* Dept A */ /* B */ /* C */ /* D */ /* E */ /* F */ 22 ; SPIDA Michael Friendly
63 mossoft Data set berkeley: dept gender admit freq 1 M M F F M M F F M M F F M M F F M M F F M M F F SPIDA Michael Friendly
64 mossoft mosaic macro example: Berkeley data mosaic9m.sas 1 goptions hsize=7in vsize=7in; 2 3 %include catdata(berkeley); 4 5 *-- apply character formats to numeric table variables; 6 %table(data=berkeley, 7 var=admit Gender Dept, 8 weight=freq, 9 char=y, format=admit admit. gender $sex. dept dept., 10 order=data, out=berkeley); %mosaic(data=berkeley, 13 vorder=dept Gender Admit, /* reorder variables */ 14 plots=2:3, /* which plots? */ 15 fittype=joint, /* fit joint indep. */ 16 split=h V V, htext=3); /* options */ SPIDA Michael Friendly
65 mossoft mosaic macro example: Berkeley data Model: (Dept)(Gender) Model: (DeptGender)(Admit) A B C D E F A B C D E F Male Two-way, Dept. by Gender Female Admitted Rejected Male Female Three-way, Dept. by Gender by Admit SPIDA Michael Friendly
66 mossoft mosmat macro: Mosaic matrices mosmat9m.sas 1 %include catdata(berkeley); 2 %mosmat(data=berkeley, 3 vorder=admit Gender Dept, sort=no); Admit Admit Reject Admit Reject Male Female A B C D E F Male Female Gender Male Female Admit Reject A B C D E F F A B C D E A B C D E F Dept Admit Reject Male Female SPIDA Michael Friendly
67 mossoft Partial mosaics mospart3.sas 1 %include catdata(hairdat3s); 2 %gdispla(off); 3 %mosaic(data=haireye, 4 vorder=hair Eye Sex, by=sex, 5 htext=2, cellfill=dev); 6 %gdispla(on); 7 %panels(rows=1, cols=2); /* make 2 figs -> 1 */ Sex: Male Sex: Female Brown Hazel Green Blue Brown Hazel Green Blue Black Brown Red -3.3 Blond Black Brown Red Blond SPIDA Michael Friendly
68 corresp Correspondence analysis and MCA Correspondence analysis (CA): Analog of PCA for frequency data: account for maximum % of χ 2 in few (2-3) dimensions finds scores for row (x im ) and column (y jm ) categories on these dimensions uses Singular Value Decomposition of residuals from independence, d ij = (n ij m ij )/ m ij d ij n = M m=1 λ m x im y jm optimal scaling: each pair of scores, x im ) and column (y jm, have highest possible correlation (= λ m ). plots of the row (x im ) and column (y jm ) scores show associations MCA: Extends CA to n-way tables, but only uses bivariate associations (like mosaic matrix) SPIDA Michael Friendly
69 corresp Hair color, Eye color data: 0.5 * Eye color * HAIR COLOR RED Green Dim 2 (9.5%) 0.0 Brown BLACK Hazel BROWN Blue BLOND Dim 1 (89.4%) Interpretation: row/column points near each other are positively associated Dim 1: 89.4% of χ 2 (dark light) Dim 2: 9.5% of χ 2 (RED/Green vs. others) SPIDA Michael Friendly
70 corresp PROC CORRESP and the CORRESP macro Two forms of input dataset: dataset in contingency table form column variables are levels of one factor, observations (rows) are levels of the other. Obs Eye BLACK BROWN RED BLOND 1 Brown Blue Hazel Green Raw category responses (case form), or cell frequencies (frequency form), classified by 2 or more factors (e.g., output from PROC FREQ) Obs Eye HAIR Count 1 Brown BLACK 68 2 Brown BROWN Brown RED 26 4 Brown BLOND Green RED Green BLOND 16 SPIDA Michael Friendly
71 corresp PROC CORRESP PROC CORRESP and the CORRESP macro Handles 2-way CA, extensions to n-way tables, and MCA Many options for scaling row/column coordinates and output statistics OUTC= option output dataset for plotting (PROC CORRESP doesn t do plots itself) CORRESP macro Uses PROC CORRESP for analysis Produces labeled plots of the category points in either 2 or 3 dimensions Many graphic options; can equate axes automatically See: SPIDA Michael Friendly
72 corresp Example: Hair and Eye Color Input the data in contingency table form 1 data haireye; corresp2a.sas 2 input EYE $ BLACK BROWN RED BLOND ; 3 datalines; 4 Brown Blue Hazel Green ; SPIDA Michael Friendly
73 corresp Example: Hair and Eye Color Using PROC CORRESP directly labeled printer plot proc corresp data=haireye outc=coord short; id eye; /* row variable */ var black brown red blond; /* col variables */ proc plot data=coord vtoh=2; /* plot step */ plot dim2 * dim1 = * $eye / box haxis=by.1 vaxis=by.1; /* plot options */ Using the CORRESP macro labeled high-res plot %corresp (data=haireye, id=eye, /* row variable */ var=black brown red blond, /* col variables */ dimlab=dim); /* options */ SPIDA Michael Friendly
74 corresp Example: Hair and Eye Color Printed output: The Correspondence Analysis Procedure Inertia and Chi-Square Decomposition Singular Principal Chi- Values Inertias Squares Percents % ************************* % *** % (Degrees of Freedom = 9) Row Coordinates Dim1 Dim2 Brown Blue Hazel Green Column Coordinates Dim1 Dim2 BLACK BROWN RED BLOND SPIDA Michael Friendly
75 corresp Example: Hair and Eye Color Output dataset(selected variables): Obs _TYPE_ EYE DIM1 DIM2 1 INERTIA.. 2 OBS Brown OBS Blue OBS Hazel OBS Green VAR BLACK VAR BROWN VAR RED VAR BLOND Row and column points are distinguished by the _TYPE_ variable: OBS vs. VAR SPIDA Michael Friendly
76 corresp Example: Hair and Eye Color Graphic output from CORRESP macro: * Eye color * HAIR COLOR 0.5 RED Green Dim 2 (9.5%) 0.0 Brown BLACK Hazel BROWN Blue BLOND Dim 1 (89.4%) Top legend produced with Annotate data set and the INANNO= option to the CORRESP macro SPIDA Michael Friendly
77 corresp Multi-way tables Stacking approach: van der Heijden and de Leeuw (1985) three-way table, of size I J K can be sliced and stacked as a two-way table, of size (I J) K I x J x K table J (I J )x K table I J J K J The variables combined are treated interactively Each way of stacking corresponds to a loglinear model (I J) K [AB][C] I (J K) [A][BC] J (I K) [B][AC] K SPIDA Michael Friendly
78 corresp Multi-way tables: Stacking PROC CORRESP: Use TABLES statement and option CROSS=ROW or CROSS=COL. E.g., for model [A B] [C], proc corresp cross=row; tables A B, C; weight count; CORRESP macro: Can use / instead of, %corresp( options=cross=row, tables=a B/ C, weight count); SPIDA Michael Friendly
79 corresp Example: Suicide Rates Suicide rates in West Germany, by Age, Sex and Method of suicide Sex Age POISON GAS HANG DROWN GUN JUMP M M M M M F F F F F CA of the [Age Sex] by [Method] table: Shows associations between the Age-Sex combinations and Method Ignores association between Age and Sex SPIDA Michael Friendly
80 corresp Example: Suicide Rates suicide5.sas 1 %include catdata(suicide); 2 *-- equate axes!; 3 axis1 order=(-.7 to.7 by.7) length=6.5 in label=(a=90 r=0); 4 axis2 order=(-.7 to.7 by.7) length=6.5 in; 5 %corresp(data=suicide, weight=count, 6 tables=%str(age sex, method), 7 options=cross=row short, 8 vaxis=axis1, haxis=axis2); Output: Inertia and Chi-Square Decomposition Singular Principal Chi- Values Inertias Squares Percents % ************************* % ************** % ** % % (Degrees of Freedom = 45) SPIDA Michael Friendly
81 corresp 0.7 Gas * F * F * M Gun Dimension 2 (33.0%) 0.0 Poison * M * F * M Jump * F * F Hang * M Drown * M Dimension 1 (60.4%) SPIDA Michael Friendly
82 MALE FEMALE Gun Gas Hang Poison Jump Drown Loglinear Models for Categorical Data corresp Compare with mosaic display: Suicide data - Model (SexAge)(Method) >65 SPIDA Michael Friendly Standardized residuals: <-8-8:-4-4:-2-2:-0 0:2 2:4 4:8 >8
83 structured Ordered categories More structured tables Tables with ordered categories allow more parsimonious representations of association Can represent λ AB ij by a small number of parameters more focused and more powerful tests of lack of independence Square tables For square I I tables, where row and column variables have the same categories: Can ignore diagonal cells, where association is expected and test remaining association (quasi-independence) Can test whether association is symmetric around the diagonal cells. SPIDA Michael Friendly
84 structured Ordered categories Ordinal scores In many cases it may be reasonable to assign numeric scores, {a i } to an ordinal row variable and/or numeric scores, {b i } to an ordinal column variable. Typically, scores are equally spaced and sum to zero, {a i } = i (I + 1)/2, e.g., {a i } = { 1, 0, 1} for I=3. Linear-by-Linear (Uniform) Association: When both variables are ordinal, the simplest model posits that any association is linear in both variables. λ AB ij = γ a i b j This model adds one additional parameter to the independence model (γ = 0). It is similar to For integer scores, the local log odds ratio for any contiguous 2 2 table are all equal, log θ ij = γ This is a model of uniform association SPIDA Michael Friendly
85 structured Row Effects and Column Effects: When only one variable is assigned scores, we have the row effects model or the column effects model. E.g., in the row effects model, the row variable (A) is treated as nominal, while the column variable (B) is assigned ordered scores {b j }. log m ij = µ + λ A i + λ B j + α i b j where the row parameters, α i, are defined so they sum to zero. This model has (I 1) more parameters than the independence model. A Row Effects + Column Effects model allows both variables to be ordered, but not necessarily with linear scores. Fitting models for ordinal variables Create numeric variables for category scores PROC GENMOD: Use as quantitative variables in MODEL statement, but not listed as CLASS variables SPIDA Michael Friendly
86 structured Example: Mental impairment and parents SES Srole et al. (1978) Data on mental health status of 1600 young NYC residents in relation to parents SES. Mental health: Well, mild symptoms, moderate symptoms, Impaired SES: 1 (High) 6 (Low) Mental Parents SES health High Low 1: Well : Mild : Moderate : Impaired SPIDA Michael Friendly
87 structured Mental Impairment and SES: Independence Linear x Linear Mental Well Mild Moderate Impaired Mental Well Mild Moderate Impaired High Low SES High Low SES Visual assessment: Residuals from the independence model show an opposite-corner pattern. This is consistent with both: Linear linear model: equi-spaced scores for both Mental and SES Row effects model: equi-spaced scores for SES, ordered scores for Mental SPIDA Michael Friendly
88 structured Statistical assesment: Table 8: Mental health data: Goodness-of-fit statistics for ordinal loglinear models Model G 2 df Pr(> G 2 ) AIC AIC-best Independence Col effects (SES) Row effects (mental) Lin x Lin Both the Row Effects and Linear linear models are significantly better than the Independence model AIC indicates a slight preference for the Linear linear model In the Linear linear model, the estimate of the coefficient of a i b j is ˆγ = = log θ, so ˆθ = exp(0.0907) = each step down the SES scale increases the odds of being classified one step poorer in mental health by 9.5%. SPIDA Michael Friendly
89 structured Fitting these models with PROC GENMOD: mentgen2.sas 1 %include catdata(mental); 2 data mental; 3 set mental; 4 m_lin = mental; *-- copy m_lin and s_lin for; 5 s_lin = ses; *-- use non-class variables; 6 7 title Independence model ; 8 proc genmod data=mental; 9 class mental ses; 10 model count = mental ses / dist=poisson obstats residuals; 11 format mental mental. ses ses.; 12 ods output obstats=obstats; 13 %mosaic(data=obstats, vorder=mental SES, resid=stresdev, 14 title=mental Impairment and SES: Independence, split=h V); Row Effects model: mentgen2.sas 16 proc genmod data=mental; 17 class mental ses; 18 model count = mental ses mental*s_lin / dist=poisson obstats; Linear linear model: mentgen2.sas 21 proc genmod data=mental; 22 class mental ses; 23 model count = mental ses m_lin*s_lin / dist=poisson obstats; SPIDA Michael Friendly
90 square Square-ish tables Tables where two (or more) variables have the same category levels: Employment categories of related persons (mobility tables) Multiple measurements over time (panel studies; longitudinal data) Repeated measures on the same individuals under different conditions Related/repeated measures are rarely independent, but may have simpler forms than general association E.g., vision data: Left and right eye acuity grade for 7477 women Independence, G2(9)= Low RightEye High 2 3 High 2 3 LeftEye Low SPIDA Michael Friendly
91 square Quasi-Independence Related/repeated measures are rarely independent most observations often fall on diagonal cells. Quasi-independence ignores diagonals, tests independence in remaining cells (λ ij = 0 for i j). The model dedicates one parameter (δ i ) to each diagonal cell, fitting them exactly, where I( ) is the indicator function. log m ij = µ + λ A i + λ B j + δ i I(i = j) This model may be fit as a GLM by including indicator variables for each diagonal cell. DIAG 4 rows 4 cols SPIDA Michael Friendly
92 square Using PROC GENMOD mosaic10g.sas 1 title Quasi-independence model (women) ; 2 proc genmod data=women; 3 class RightEye LeftEye diag; 4 model Count = LeftEye RightEye diag / 5 dist=poisson link=log obstats residuals; 6 ods output obstats=obstats; 7 %mosaic(data=obstats, vorder=righteye LeftEye,...); Mosaic: Quasi-Independence, G2(5)=199.1 Low RightEye High 2 3 High 2 3 LeftEye Low SPIDA Michael Friendly
93 square Symmetry Tests whether the table is symmetric around the diagonal, i.e., m ij = m ji As a loglinear model, symmetry is log m ij = µ + λ A i + λ B j + λ AB ij, subject to the conditions λ A i = λ B j and λ AB ij = λ AB ji. This model may be fit as a GLM by including indicator variables with equal values for symmetric cells, and indicators for the diagonal cells (fit exactly) SYMMETRY 4 rows 4 cols) SPIDA Michael Friendly
94 square Using PROC GENMOD mosaic10g.sas 1 proc genmod data=women; 2 class symmetry; 3 model Count = symmetry / 4 dist=poisson link=log obstats residuals; 5 ods output obstats=obstats; 6 %mosaic(data=obstats, vorder=righteye LeftEye,...); Mosaic: Symmetry, G2(6)=19.25 Low RightEye High 2 3 High 2 3 LeftEye Low SPIDA Michael Friendly
95 square Quasi-Symmetry Symmetry is often too restrictive: equal marginal frequencies (λ A i = λ B i ) PROC GENMOD: Use the usual marginal effect parameters + symmetry: mosaic10g.sas 1 proc genmod data=women; 2 class LeftEye RightEye symmetry; 3 model Count = LeftEye RightEye symmetry / 4 dist=poisson link=log obstats residuals; 5 ods output obstats=obstats; Quasi-Symmetry, G2(3)=7.27 Low RightEye High 2 3 High 2 3 LeftEye Low SPIDA Michael Friendly
96 square Table 9: Summary of models fit to vision data Model G 2 df Pr(> G 2 ) AIC AIC - min(aic) Independence Linear*Linear Row+Column Effects Quasi-Independence Symmetry Quasi-Symmetry Ordinal Quasi-Symmetry Only the quasi-symmetry models provide an acceptable fit The ordinal quasi-symmetry model is most parsimonious SPIDA Michael Friendly
97 loglin References Bickel, P. J., Hammel, J. W., and O Connell, J. W. Sex bias in graduate admissions: Data from Berkeley. Science, 187: , Friendly, M. SAS System for Statistical Graphics. SAS Institute, Cary, NC, 1st edition, Friendly, M. Mosaic displays for multi-way contingency tables. Journal of the American Statistical Association, 89: , Friendly, M. Conceptual and visual models for categorical data. The American Statistician, 49: , Friendly, M. Extending mosaic displays: Marginal, conditional, and partial views of categorical data. Journal of Computational and Graphical Statistics, 8(3): , Friendly, M. Visualizing Categorical Data. SAS Institute, Cary, NC, Friendly, M. Corrgrams: Exploratory displays for correlation matrices. The American Statistician, 56(4): , Friendly, M. and Kwan, E. Effect ordering for data displays. Computational Statistics and Data Analysis, 43(4): , Hartigan, J. A. and Kleiner, B. Mosaics for contingency tables. In Eddy, W. F., editor, Computer Science and Statistics: Proceedings of the 13th Symposium on the Interface, pp Springer-Verlag, New York, NY, Srole, L., Langner, T. S., Michael, S. T., Kirkpatrick, P., Opler, M. K., and Rennie, T. A. C. Mental Health in the Metropolis: The Midtown Manhattan Study. NYU Press, New York, SPIDA Michael Friendly
98 loglin Tufte, E. R. The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT, van der Heijden, P. G. M. and de Leeuw, J. Correspondence analysis used complementary to loglinear analysis. Psychometrika, 50: , SPIDA Michael Friendly
Applied Multivariate Analysis
Department of Mathematics and Statistics, University of Vaasa, Finland Spring 2017 Choosing Statistical Method 1 Choice an appropriate method 2 Cross-tabulation More advance analysis of frequency tables
More informationLoglinear and Logit Models for Contingency Tables
Chapter 8 Loglinear and Logit Models for Contingency Tables Loglinear models comprise another special case of generalized linear models designed for contingency tables of frequencies. They are most easily
More informationTwo-way contingency tables
Chapter 4 Two-way contingency tables The analysis of two-way frequency tables concerns the association between two variables. A variety of specialized graphical displays help to visualize the pattern of
More informationANALYSIS OF NOMINAL DATA MULTI-WAY CONTINGENCY TABLE
M A T H E M A T I C A L E C O N O M I C S No. 7(14) 2011 ANALYSIS OF NOMINAL DATA MULTI-WAY CONTINGENCY TABLE Abstract. Presented in this paper the method of graphical presentation of the relationship
More informationStatistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975.
Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975. SPSS Statistics were designed INTRODUCTION TO SPSS Objective About the
More informationSpatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data
Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated
More informationMichael Friendly. Categorical Data Analysis with Graphics
-5 0 5 Sqrt(frequency) 10 15 20 25 30 35 40 April, 2003 SUGI 28 York University, Toronto, Michael Friendly Black Brown Red Blond 0 2 4 6 8 10 12 Number of males Left Eye Grade High
More informationModelling Proportions and Count Data
Modelling Proportions and Count Data Rick White May 4, 2016 Outline Analysis of Count Data Binary Data Analysis Categorical Data Analysis Generalized Linear Models Questions Types of Data Continuous data:
More informationVersion 3.6. Michael Friendly Psychology Department York University
User s Guide for MOSAICS Version 3.6 Michael Friendly Psychology Department York University Contents 1 Introduction 1 2 Installation Guide 2 2.1 How to obtain MOSAICS.... 3 2.2 Installing MOSAICS.......
More informationAn introduction to SPSS
An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible
More informationFrequency Tables. Chapter 500. Introduction. Frequency Tables. Types of Categorical Variables. Data Structure. Missing Values
Chapter 500 Introduction This procedure produces tables of frequency counts and percentages for categorical and continuous variables. This procedure serves as a summary reporting tool and is often used
More informationModelling Proportions and Count Data
Modelling Proportions and Count Data Rick White May 5, 2015 Outline Analysis of Count Data Binary Data Analysis Categorical Data Analysis Generalized Linear Models Questions Types of Data Continuous data:
More informationCategorical Data Analysis with Graphics. Michael Friendly. April, 2003 SUGI 28.
SUGI 8 Michael Friendly Color version of these slides: http://www.math.yorku.ca/scs/sugi/sugi8.pdf Logit plots, effect plots Diagnostic plots Logistic and logit regression Mosaic displays Mosaic matrices
More information2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES
EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 Objectives 2.1 What Are the Types of Data? www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative
More informationAND NUMERICAL SUMMARIES. Chapter 2
EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 What Are the Types of Data? 2.1 Objectives www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative
More informationIntermediate SAS: Statistics
Intermediate SAS: Statistics OIT TSS 293-4444 oithelp@mail.wvu.edu oit.wvu.edu/training/classmat/sas/ Table of Contents Procedures... 2 Two-sample t-test:... 2 Paired differences t-test:... 2 Chi Square
More informationGeneral Factorial Models
In Chapter 8 in Oehlert STAT:5201 Week 9 - Lecture 2 1 / 34 It is possible to have many factors in a factorial experiment. In DDD we saw an example of a 3-factor study with ball size, height, and surface
More information8. MINITAB COMMANDS WEEK-BY-WEEK
8. MINITAB COMMANDS WEEK-BY-WEEK In this section of the Study Guide, we give brief information about the Minitab commands that are needed to apply the statistical methods in each week s study. They are
More informationSTATS PAD USER MANUAL
STATS PAD USER MANUAL For Version 2.0 Manual Version 2.0 1 Table of Contents Basic Navigation! 3 Settings! 7 Entering Data! 7 Sharing Data! 8 Managing Files! 10 Running Tests! 11 Interpreting Output! 11
More informationStatistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.
Statistical Methods Instructor: Lingsong Zhang 1 Issues before Class Statistical Methods Lingsong Zhang Office: Math 544 Email: lingsong@purdue.edu Phone: 765-494-7913 Office Hour: Monday 1:00 pm - 2:00
More informationUNIT 4. Research Methods in Business
UNIT 4 Preparing Data for Analysis:- After data are obtained through questionnaires, interviews, observation or through secondary sources, they need to be edited. The blank responses, if any have to be
More informationSAS/STAT 13.1 User s Guide. The NESTED Procedure
SAS/STAT 13.1 User s Guide The NESTED Procedure This document is an individual chapter from SAS/STAT 13.1 User s Guide. The correct bibliographic citation for the complete manual is as follows: SAS Institute
More informationGeneral Factorial Models
In Chapter 8 in Oehlert STAT:5201 Week 9 - Lecture 1 1 / 31 It is possible to have many factors in a factorial experiment. We saw some three-way factorials earlier in the DDD book (HW 1 with 3 factors:
More informationResearch Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel
Research Methods for Business and Management Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel A Simple Example- Gym Purpose of Questionnaire- to determine the participants involvement
More information4 Displaying Multiway Tables
4 Displaying Multiway Tables An important subset of statistical data comes in the form of tables. Tables usually record the frequency or proportion of observations that fall into a particular category
More informationTable of Contents (As covered from textbook)
Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression
More informationChapter Two: Descriptive Methods 1/50
Chapter Two: Descriptive Methods 1/50 2.1 Introduction 2/50 2.1 Introduction We previously said that descriptive statistics is made up of various techniques used to summarize the information contained
More informationStrategies for Modeling Two Categorical Variables with Multiple Category Choices
003 Joint Statistical Meetings - Section on Survey Research Methods Strategies for Modeling Two Categorical Variables with Multiple Category Choices Christopher R. Bilder Department of Statistics, University
More informationPOLS 205 Political Science as a Social Science. Analyzing Tables of Data
POLS 205 Political Science as a Social Science Analyzing Tables of Data Christopher Adolph University of Washington, Seattle May 17, 2010 Chris Adolph (UW) Tabular Data May 17, 2010 1 / 67 Review Inference
More informationECLT 5810 Data Preprocessing. Prof. Wai Lam
ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate
More informationPoints Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked
Plotting Menu: QCExpert Plotting Module graphs offers various tools for visualization of uni- and multivariate data. Settings and options in different types of graphs allow for modifications and customizations
More informationThe NESTED Procedure (Chapter)
SAS/STAT 9.3 User s Guide The NESTED Procedure (Chapter) SAS Documentation This document is an individual chapter from SAS/STAT 9.3 User s Guide. The correct bibliographic citation for the complete manual
More informationGRAPHICAL METHODS FOR CATEGORICAL DATA. Michael Friendly, York University. Seive diagrams
GRAPHCAL METHODS FOR CATEGORCAL DATA Michael Friendly York University Abstract Statistical methods for categorical data such as loglinear models and logistic regression represent discrete analogs of the
More informationARTIFICIAL INTELLIGENCE (CS 370D)
Princess Nora University Faculty of Computer & Information Systems ARTIFICIAL INTELLIGENCE (CS 370D) (CHAPTER-18) LEARNING FROM EXAMPLES DECISION TREES Outline 1- Introduction 2- know your data 3- Classification
More informationUnit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users
BIOSTATS 640 Spring 2018 Review of Introductory Biostatistics STATA solutions Page 1 of 13 Key Comments begin with an * Commands are in bold black I edited the output so that it appears here in blue Unit
More informationLab #9: ANOVA and TUKEY tests
Lab #9: ANOVA and TUKEY tests Objectives: 1. Column manipulation in SAS 2. Analysis of variance 3. Tukey test 4. Least Significant Difference test 5. Analysis of variance with PROC GLM 6. Levene test for
More informationEnterprise Miner Tutorial Notes 2 1
Enterprise Miner Tutorial Notes 2 1 ECT7110 E-Commerce Data Mining Techniques Tutorial 2 How to Join Table in Enterprise Miner e.g. we need to join the following two tables: Join1 Join 2 ID Name Gender
More informationExcel 2010 with XLSTAT
Excel 2010 with XLSTAT J E N N I F E R LE W I S PR I E S T L E Y, PH.D. Introduction to Excel 2010 with XLSTAT The layout for Excel 2010 is slightly different from the layout for Excel 2007. However, with
More informationFeature Selection Using Modified-MCA Based Scoring Metric for Classification
2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification
More informationInference for loglinear models (contd):
Stat 504, Lecture 25 1 Inference for loglinear models (contd): Loglinear/Logit connection Intro to Graphical Models Stat 504, Lecture 25 2 Loglinear Models no distinction between response and explanatory
More informationNCSS Statistical Software
Chapter 327 Geometric Regression Introduction Geometric regression is a special case of negative binomial regression in which the dispersion parameter is set to one. It is similar to regular multiple regression
More information186 Statistics, Data Analysis and Modeling. Proceedings of MWSUG '95
A Statistical Analysis Macro Library in SAS Carl R. Haske, Ph.D., STATPROBE, nc., Ann Arbor, M Vivienne Ward, M.S., STATPROBE, nc., Ann Arbor, M ABSTRACT Statistical analysis plays a major role in pharmaceutical
More informationRegression III: Advanced Methods
Lecture 3: Distributions Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture Examine data in graphical form Graphs for looking at univariate distributions
More informationSummarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester
Summarising Data Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 09/10/2018 Summarising Data Today we will consider Different types of data Appropriate ways to summarise these
More informationIntroduction to Mixed Models: Multivariate Regression
Introduction to Mixed Models: Multivariate Regression EPSY 905: Multivariate Analysis Spring 2016 Lecture #9 March 30, 2016 EPSY 905: Multivariate Regression via Path Analysis Today s Lecture Multivariate
More informationMixed Effects Models. Biljana Jonoska Stojkova Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC.
Mixed Effects Models Biljana Jonoska Stojkova Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC March 6, 2018 Resources for statistical assistance Department of Statistics
More informationbook 2014/5/6 15:21 page v #3 List of figures List of tables Preface to the second edition Preface to the first edition
book 2014/5/6 15:21 page v #3 Contents List of figures List of tables Preface to the second edition Preface to the first edition xvii xix xxi xxiii 1 Data input and output 1 1.1 Input........................................
More informationTHIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010
THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE
More informationHierarchical Clustering / Dendrograms
Chapter 445 Hierarchical Clustering / Dendrograms Introduction The agglomerative hierarchical clustering algorithms available in this program module build a cluster hierarchy that is commonly displayed
More informationBox-Cox Transformation for Simple Linear Regression
Chapter 192 Box-Cox Transformation for Simple Linear Regression Introduction This procedure finds the appropriate Box-Cox power transformation (1964) for a dataset containing a pair of variables that are
More informationCS 4460 Intro. to Information Visualization Sep. 18, 2017 John Stasko
Multivariate Visual Representations 1 CS 4460 Intro. to Information Visualization Sep. 18, 2017 John Stasko Learning Objectives For the following visualization techniques/systems, be able to describe each
More informationFrequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM
Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM * Which directories are used for input files and output files? See menu-item "Options" and page 22 in the manual.
More informationGraphical Presentation for Statistical Data (Relevant to AAT Examination Paper 4: Business Economics and Financial Mathematics) Introduction
Graphical Presentation for Statistical Data (Relevant to AAT Examination Paper 4: Business Economics and Financial Mathematics) Y O Lam, SCOPE, City University of Hong Kong Introduction The most convenient
More informationSubset Selection in Multiple Regression
Chapter 307 Subset Selection in Multiple Regression Introduction Multiple regression analysis is documented in Chapter 305 Multiple Regression, so that information will not be repeated here. Refer to that
More informationSAS data statements and data: /*Factor A: angle Factor B: geometry Factor C: speed*/
STAT:5201 Applied Statistic II (Factorial with 3 factors as 2 3 design) Three-way ANOVA (Factorial with three factors) with replication Factor A: angle (low=0/high=1) Factor B: geometry (shape A=0/shape
More informationPreprocessing Short Lecture Notes cse352. Professor Anita Wasilewska
Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept
More information- 1 - Fig. A5.1 Missing value analysis dialog box
WEB APPENDIX Sarstedt, M. & Mooi, E. (2019). A concise guide to market research. The process, data, and methods using SPSS (3 rd ed.). Heidelberg: Springer. Missing Value Analysis and Multiple Imputation
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationCS 237: Probability in Computing
CS 237: Probability in Computing Wayne Snyder Computer Science Department Boston University Lecture 25: Logistic Regression Motivation: Why Logistic Regression? Sigmoid functions the logit transformation
More informationPredict Outcomes and Reveal Relationships in Categorical Data
PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,
More informationWant to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research
Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research Liping Huang, Center for Home Care Policy and Research, Visiting Nurse Service of New York, NY, NY ABSTRACT The
More informationZero-Inflated Poisson Regression
Chapter 329 Zero-Inflated Poisson Regression Introduction The zero-inflated Poisson (ZIP) regression is used for count data that exhibit overdispersion and excess zeros. The data distribution combines
More informationRaw Data is data before it has been arranged in a useful manner or analyzed using statistical techniques.
Section 2.1 - Introduction Graphs are commonly used to organize, summarize, and analyze collections of data. Using a graph to visually present a data set makes it easy to comprehend and to describe the
More informationInteractive Math Glossary Terms and Definitions
Terms and Definitions Absolute Value the magnitude of a number, or the distance from 0 on a real number line Addend any number or quantity being added addend + addend = sum Additive Property of Area the
More informationHierarchical Generalized Linear Models
Generalized Multilevel Linear Models Introduction to Multilevel Models Workshop University of Georgia: Institute for Interdisciplinary Research in Education and Human Development 07 Generalized Multilevel
More informationAnalysis of Complex Survey Data with SAS
ABSTRACT Analysis of Complex Survey Data with SAS Christine R. Wells, Ph.D., UCLA, Los Angeles, CA The differences between data collected via a complex sampling design and data collected via other methods
More informationBrief Guide on Using SPSS 10.0
Brief Guide on Using SPSS 10.0 (Use student data, 22 cases, studentp.dat in Dr. Chang s Data Directory Page) (Page address: http://www.cis.ysu.edu/~chang/stat/) I. Processing File and Data To open a new
More informationNOTES TO CONSIDER BEFORE ATTEMPTING EX 1A TYPES OF DATA
NOTES TO CONSIDER BEFORE ATTEMPTING EX 1A TYPES OF DATA Statistics is concerned with scientific methods of collecting, recording, organising, summarising, presenting and analysing data from which future
More informationUsing Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers
Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers Why enhance GLM? Shortcomings of the linear modelling approach. GLM being
More informationLESSON 1: INTRODUCTION TO COUNTING
LESSON 1: INTRODUCTION TO COUNTING Counting problems usually refer to problems whose question begins with How many. Some of these problems may be very simple, others quite difficult. Throughout this course
More informationDownloaded from
UNIT 2 WHAT IS STATISTICS? Researchers deal with a large amount of data and have to draw dependable conclusions on the basis of data collected for the purpose. Statistics help the researchers in making
More informationRight-click on whatever it is you are trying to change Get help about the screen you are on Help Help Get help interpreting a table
Q Cheat Sheets What to do when you cannot figure out how to use Q What to do when the data looks wrong Right-click on whatever it is you are trying to change Get help about the screen you are on Help Help
More informationFrequency Distributions
Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,
More informationIntroducing Categorical Data/Variables (pp )
Notation: Means pencil-and-paper QUIZ Means coding QUIZ Definition: Feature Engineering (FE) = the process of transforming the data to an optimal representation for a given application. Scaling (see Chs.
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and
More informationLecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 2.1- #
Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series by Mario F. Triola Chapter 2 Summarizing and Graphing Data 2-1 Review and Preview 2-2 Frequency Distributions 2-3 Histograms
More informationCorrectly Compute Complex Samples Statistics
SPSS Complex Samples 15.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample
More informationUser Services Spring 2008 OBJECTIVES Introduction Getting Help Instructors
User Services Spring 2008 OBJECTIVES Use the Data Editor of SPSS 15.0 to to import data. Recode existing variables and compute new variables Use SPSS utilities and options Conduct basic statistical tests.
More informationSAS/STAT 14.3 User s Guide The SURVEYFREQ Procedure
SAS/STAT 14.3 User s Guide The SURVEYFREQ Procedure This document is an individual chapter from SAS/STAT 14.3 User s Guide. The correct bibliographic citation for this manual is as follows: SAS Institute
More informationBiostat Methods STAT 5820/6910 Handout #4: Chi-square, Fisher s, and McNemar s Tests
Biostat Methods STAT 5820/6910 Handout #4: Chi-square, Fisher s, and McNemar s Tests Example 1: 152 patients were randomly assigned to 4 dose groups in a clinical study. During the course of the study,
More informationMean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242
Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Creation & Description of a Data Set * 4 Levels of Measurement * Nominal, ordinal, interval, ratio * Variable Types
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning
More informationThe basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student
Organizing data Learning Outcome 1. make an array 2. divide the array into class intervals 3. describe the characteristics of a table 4. construct a frequency distribution table 5. constructing a composite
More informationSTA 570 Spring Lecture 5 Tuesday, Feb 1
STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row
More informationIQR = number. summary: largest. = 2. Upper half: Q3 =
Step by step box plot Height in centimeters of players on the 003 Women s Worldd Cup soccer team. 157 1611 163 163 164 165 165 165 168 168 168 170 170 170 171 173 173 175 180 180 Determine the 5 number
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationAdvances in Visualizing Categorical Data Using the vcd, gnm and vcdextra Packages in R
intro vcd gnm 3D odds References Advances in Visualizing Categorical Data Using the vcd, gnm and vcdextra Packages in R Michael Friendly 1 Heather Turner 2 David Firth 2 Achim Zeileis 3 1 Psychology Department
More information2.3 Algorithms Using Map-Reduce
28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationSection 4 General Factorial Tutorials
Section 4 General Factorial Tutorials General Factorial Part One: Categorical Introduction Design-Ease software version 6 offers a General Factorial option on the Factorial tab. If you completed the One
More information1 RefresheR. Figure 1.1: Soy ice cream flavor preferences
1 RefresheR Figure 1.1: Soy ice cream flavor preferences 2 The Shape of Data Figure 2.1: Frequency distribution of number of carburetors in mtcars dataset Figure 2.2: Daily temperature measurements from
More informationLearner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display
CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &
More informationFrequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS
ABSTRACT Paper 1938-2018 Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS Robert M. Lucas, Robert M. Lucas Consulting, Fort Collins, CO, USA There is confusion
More informationMATH 117 Statistical Methods for Management I Chapter Two
Jubail University College MATH 117 Statistical Methods for Management I Chapter Two There are a wide variety of ways to summarize, organize, and present data: I. Tables 1. Distribution Table (Categorical
More informationHYPERVARIATE DATA VISUALIZATION
HYPERVARIATE DATA VISUALIZATION Prof. Rahul C. Basole CS/MGT 8803-DV > January 25, 2017 Agenda Hypervariate Data Project Elevator Pitch Hypervariate Data (n > 3) Many well-known visualization techniques
More informationD-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview
Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,
More informationMINITAB 17 BASICS REFERENCE GUIDE
MINITAB 17 BASICS REFERENCE GUIDE Dr. Nancy Pfenning September 2013 After starting MINITAB, you'll see a Session window above and a worksheet below. The Session window displays non-graphical output such
More informationBar Charts and Frequency Distributions
Bar Charts and Frequency Distributions Use to display the distribution of categorical (nominal or ordinal) variables. For the continuous (numeric) variables, see the page Histograms, Descriptive Stats
More informationMedoid Partitioning. Chapter 447. Introduction. Dissimilarities. Types of Cluster Variables. Interval Variables. Ordinal Variables.
Chapter 447 Introduction The objective of cluster analysis is to partition a set of objects into two or more clusters such that objects within a cluster are similar and objects in different clusters are
More informationMosaic displays for n-way tables
Chapter 5 Mosaic displays for n-way tables Mosaic displays help to visualize the pattern of associations among variables in two-way and larger tables. Extensions of this technique can reveal partial associations,
More information