Categorical Data Analysis with Graphics. Michael Friendly. April, 2003 SUGI 28.

Size: px

Start display at page:

Download "Categorical Data Analysis with Graphics. Michael Friendly. April, 2003 SUGI 28."

Avis Wheeler
5 years ago
Views:

1 SUGI 8 Michael Friendly Color version of these slides: Logit plots, effect plots Diagnostic plots Logistic and logit regression Mosaic displays Mosaic matrices Mosaic displays and loglinear models for n-way tables Fourfold displays Sieve diagrams Methods for two-way frequency tables Hanging rootograms Robust distribution plots Methods for discrete distributions Overview: Categorical Data and Graphics Outline April, SUGI 8 York University, Toronto, <friendly@yorku.ca> Michael Friendly Black Brown Red Blond 6 8 Number of males Left Eye Grade -5 High Low -5.9 Low -. Sqrt(frequency) Right Eye Grade Unaided distant vision data High Brown Hazel Green Blue Categorical Data Analysis Methods of analysis for categorical data fall into two main categories: Non-parametric, randomization-based methods make minimal assumptions useful for hypothesis-testing SAS: PROC FREQ Pearson Chi-square Fisher s exact test (for small expected frequencies) Mantel-Haenszel tests (ordered categories: test for linear association) Model-based methods Must assume random sample (possibly stratified) Useful for estimation purposes Greater flexibility; fitting specialized models (e.g., symmetry) More suitable for multi-way tables SAS: PROC LOGISTIC, CATMOD, GENMOD, INSIGHT (Fit YX) estimate standard errors, covariances for model parameters confidence intervals for parameters, predicted Pr{response} SUGI 8 Michael Friendly Graphical Methods for Categorical Data If I can t picture it, I can t understand it. Albert Einstein Getting information from a table is like extracting sunlight from a cucumber. Farquhar & Farquhar, 89 Tables vs. Graphs Tables are best suited for look-up read off exact numbers Graphs are better for showing patterns, trends, anomalies, making comparisons Visual presentation as communication: what do you want to say? SUGI 8 Michael Friendly

2 Auto data: PC/ order 8 6 Principles of Graphical Displays Effect ordering (Friendly and Kwan, ) In tables and graphs, sort unordered factors according to the effects you want to see/show. Auto data: Alpha order Corrgrams: Exploratory displays for correlation matrices (Friendly, ) SUGI 8 Michael Friendly Effect ordering and high-lighting for tables (Friendly, ) Table : Hair color - Eye color data: Effect ordered Hair color Eye color Black Brown Red Blond Brown Hazel 5 5 Green Blue Table : Hair color - Eye color data: Alpha ordered Hair color Eye color Blond Black Brown Red Blue Brown Green 5 5 Hazel Model: Independence: [Hair][Eye] χ (9)= 8.9 Color coding: <- <- <- > > > n in each cell: n < expected n > expected SUGI 8 5 Michael Friendly Displa Gratio Hroom Length MPG Price Rep77 Rep78 Rseat Trunk Turn Weight Gratio MPG Rep78 Rep77 Price Hroom Trunk Rseat Length Weight Displa Turn Turn Displa Weight Length Weight Rseat Trunk Hroom Price Rep77 Rep78 Turn Trunk Rseat Rep78 Rep77 Price MPG Gratio MPG Length Hroom Gratio Displa SUGI 8 7 Michael Friendly Admit Reject Dept Admit Reject Gender Admit Admit Reject Admit Reject Small multiples (Tufte, 98) combine stratified graphs into coherent displays Comparisons Make visual comparisons easy SUGI 8 6 Michael Friendly 5 6 Number of Occurrences 5 6 Number of Occurrences Frequency 75 Sqrt(frequency) Visual grouping connect with lines, make key comparisons contiguous Baselines compare data to model against a line, preferably horizontal Comparisons Make visual comparisons easy

3 SUGI 8 9 Michael Friendly VCD and SSSG - Make these methods available and accessible in SAS Practical power = Statistical power Probability of Use Today s goal: take-home knowledge Tomorrow s goal: dynamic, interactive graphics for categorical data Goals Diagnostic plots - influence, violation of assumptions Effect plots - estimated probabilities of response or log odds Residual plots - departures from model, omitted terms,... Plots for model-based methods Help detect patterns, trends, anomalies, suggest hypotheses Show the data, not just summaries Minimal assumptions (like non-parametric methods) Exploratory methods SUGI 8 8 Michael Friendly Fourfold display for table Mosaic plot for -way table Sex: Female Admitted Rejected Admit?: Yes Sex: Male Admit?: No Model: (DeptGender)(Admit) Frequency data: count area Quantitative data: magnitude position along an axis Visual metaphors VCD Macros & SAS/IML programs Macros, datasets available at Discrete distributions DISTPLOT Plots for discrete distributions GOODFIT Goodness-of-fit for discrete distributions ORDPLOT Ord plot for discrete distributions POISPLOT Poissonness plot ROOTGRAM Hanging rootograms Two-way and n-way tables AGREE Observer agreement chart CORRESP Plot PROC CORRESP results FFOLD Fourfold displays for k tables (macro) FOURFOLD Fourfold displays for k tables (SAS/IML) SIEVE Sieve diagrams (SAS/IML) MOSAIC Mosaic displays (macro) MOSAICS SAS/IML modules for mosaic displays MOSMAT Mosaic matrices (macro) TABLE Construct a grouped frequency table, with recoding TRIPLOT Trilinear plots for n tables SUGI 8 Michael Friendly Model-based methods ADDVAR Added variable plots for logistic regression CATPLOT Plot results from PROC CATMOD HALFNORM Half-normal plots for generalized linear models INFLGLIM Influence plots for generalized linear models INFLOGIS Influence plots for logistic regression LOGODDS Plot empirical logits and probabilities for binary data POWERLOG Power calculations for logistic regression POWERRxC Power calculations for two-way frequency table POWERx Power calculations for a table ROBUST Robust fitting for linear models Utility macros DUMMY Create dummy variables LAGS Calculate lagged frequencies for sequential analysis PANELS Arrange multiple plots in a panelled display SORT Sort a dataset by the value of a statistic or formatted value Utility Graphics utility macros: BARS, EQUATE, GDISPLA, GENSYM, GSKIP, LABEL, POINTS, PSCALE VCD Archive (vcdprog.zip) available to purchasers at: support.sas.com/publishing/bbu/5657_sample.html SUGI 8 Michael Friendly

4 Discrete distributions Counts of occurrences: accidents, words in text, blood cells with some characteristic. Data: Basic outcome value, k, k =,,..., and number of observations, nk, with that value. Example: distributions of key marker words: from, may, whilst,... in Federalist Papers by James Madison, e.g., blocks of words with may: Occurrences (k) 5 6 Blocks (nk) Example: Saxony families with children having k =,,... sons. k nk SUGI 8 Michael Friendly Discrete distributions Questions: What process gave rise to the distribution? Form of distribution: uniform, binomial, Poisson, negative binomial, geometric, etc.? Estimate parameters Visualize goodness of fit For example: Federalist Papers: might expect a Poisson(λ) distribution. Families in Saxony: might expect a Bin(n, p) distribution with n =. Perhaps p =.5 as well. SUGI 8 Michael Friendly Fitting and graphing discrete distributions VCD methods to fit, visualize, and diagnose discrete distributions: Fitting: GOODFIT macro fits uniform, binomial, Poisson, negative binomial, geometric, logarithmic series distributions (or any specified multinomial) Hanging rootograms: Sensitively assess departure between Observed, Fitted counts (ROOTGRAM macro) Ord plots: Diagnose form of a discrete distribution (ORDPLOT macro) Poissonness plots: Robust fitting and diagnostic plots for Poisson (POISPLOT macro) Robust distribution plots (DISTPLOT macro) SUGI 8 Michael Friendly GOODFIT macro: Fitting discrete distributions GOODFIT macro fits uniform, binomial, Poisson, negative binomial, geometric, logarithmic series distributions (or any specified multinomial) E.g., Try fitting Poisson model madfit.sas title "Instances of may in Federalist papers"; data madison; input count blocks; label count= Number of Occurrences 5 blocks= Blocks of Text ; 6 datalines; ; 5 %goodfit(data=madison, var=count, freq=blocks, 6 dist=poisson); SUGI 8 5 Michael Friendly

5 Fitting discrete distributions The GOODFIT macro gives a table of observed and fitted frequencies, Pearson χ residuals (CHI) and likelihood-ratio deviance residuals (DEV). Instances of may in Federalist papers COUNT BLOCKS PHAT EXP CHI DEV ====== ======= ======= SUGI 8 6 Michael Friendly Fitting discrete distributions In addition, it provides the overall goodness-of-fit tests: Goodness-of-fit test for data set MADISON Analysis variable: COUNT Number of Occurrences Distribution: POISSON Estimated Parameters: lambda =.6565 Pearson chi-square = Prob > chi-square = Likelihood ratio G = 5. Prob > chi-square =.55 Degrees of freedom = 5 The poisson model does not fit! Why? SUGI 8 7 Michael Friendly SUGI 8 9 Michael Friendly 5 6 Number of Occurrences - Sqrt(frequency) 8 6 %goodfit(data=madison, var=count, freq=blocks, dist=poisson, out=fit); %rootgram(data=fit, var=count, obs=blocks); shift histogram bars to the fitted curve judge deviations vs. horizontal line. plot freq smaller frequencies are emphasized. Tukey (97, 977): Hang & root them Hanging rootograms SUGI 8 8 Michael Friendly 5 6 Number of Occurrences Frequency 6 8 largest frequencies dominate display must assess deviations vs. a curve Problems: 6 %goodfit(data=madison, var=count, freq=blocks, dist=poisson); Discrete distributions often graphed as histograms, with a theoretical fitted distribution superimposed. What s wrong with histograms?

6 SUGI 8 Michael Friendly 5 6 Occurrences of may Frequency Ratio, (k n(k) / n(k-)) 5 slope =. intercept=-. type: Negative binomial parm: p = Diagnoses distribution as NegBin Estimates ˆp =.576 ORDPLOT macro %ordplot(data=madison, count=count, freq=blocks); Ord plots SUGI 8 Michael Friendly Fit line by WLS, using nk as weights Slope Intercept Distribution Parameter (b) (a) (parameter) estimate + Poisson (λ) λ = a + Binomial (n, p) p = b/(b ) + + Neg. binomial (n,p) p = b + Log. series (θ) θ = b θ = a plot of kpk/pk against k is linear signs of intercept and slope determine the form, give rough estimates of parameters Ord (967): for each of Poisson, Binomial, Negative Binomial, and Logarithmic Series distributions, How to tell which discrete distributions are likely candidates? Ord plots: Diagnose form of discrete distribution SUGI 8 Michael Friendly Robust plots for Poisson distribution (Hoaglin and Tukey, 985) For Poisson, plot count metameter = φ (nk) = log e (k! nk/n) vs. k Linear relation Poisson, slope gives ˆλ CI for points, diagnostic (influence) plot POISPLOT macro Ord plots lack robustness one discrepant freqency, nk affects points for both k and k + Robust distribution plots: Poisson SUGI 8 Michael Friendly Logarithmic series Binomial Number collected Number of males Frequency Ratio, (k n(k) / n(k-)) Frequency Ratio, (k n(k) / n(k-)) slope = intercept=.96 type: Binomial parm: p = slope =.6 intercept=-.79 type: Logarithmic series parm: theta =.6 Ord plot: Families in Saxony Butterfly species collected in Malaya Other distributions: Ord plots

7 SUGI 8 5 Michael Friendly Poissonness plot Influence plot for change in λ Number of Occurrences Scaled Leverage Poisson metameter, ln(k! n(k) / N) Parameter change lambda: mean =.656 exp(slope) =.56. slope =.8 intercept= Instances of may in Federalist papers Influence plot Curvilinear relation distribution is not Poisson POISPLOT macro: output SUGI 8 Michael Friendly title "Instances of may in Federalist papers"; data madison; input count blocks; label count= Number of Occurrences 5 blocks= Blocks of Text ; 6 datalines; ; 5 %poisplot(data=madison,count=count, freq=blocks); POISPLOT macro: example SUGI 8 7 Michael Friendly Correspondence analysis and MCA (CORRESP macro) Fit loglinear models, visualize lack-of-fit (MOSAIC macro) Test & visualize partial association (MOSAIC macro) Visualize pairwise association (MOSMAT macro) Visualize conditional association (MOSMAT macro) Visualize loglinear structure (MOSMAT macro) n-way tables Two-way tables tables Visualize odds ratio (FFOLD macro) k tables Homogeneity of association r tables Trilinear plots (TRIPLOT macro) r c tables Visualize association (SIEVE program) r c tables Visualize association (MOSAIC macro) Square r r tables Visualize agreement (AGREE program) Contingency tables SUGI 8 6 Michael Friendly 5 6 Number of Occurrences - -9 n: a/log(p) =. p: -e(b) =.69 slope(b) = -.99 intercept= Count metameter Linear relation correct distribution, slope gives parameter estimates CI reflect variability of the individual counts, nk DISTPLOT macro %distplot(data=madison, count=count, freq=blocks, dist=negbin); Other distributions: Analogous plots, for suitable count metameter, φ (nk) vs. k. Generalized robust distribution plots

8 Methods for tables Bickel et al. (975): data on admissions to graduate depatments at U. C. Berkeley in 97. Aggregate data for the six largest departments: Table : Admissions to Berkeley graduate programs Admitted Rejected Total % Admitted Males Females Total Evidence for gender bias? G () = 9.7, χ () = 9., p <. Odds ratio, θ = Odds(Admit Male) Odds(Admit =.8 Female) Males 8% more likely to be admitted. SUGI 8 8 Michael Friendly Standard analysis: PROC FREQ proc freq data=berkeley; weight freq; tables gender*admit / chisq; Output: Statistics for Table of gender by admit Statistic DF Value Prob Chi-Square 9.5 <. Likelihood Ratio Chi-Square 9.9 <. Continuity Adj. Chi-Square <. Mantel-Haenszel Chi-Square 9.89 <. Phi Coefficient.7 How to visualize and interpret? SUGI 8 9 Michael Friendly SUGI 8 Michael Friendly Only one department (A) shows association; θa =.9 women (.9) =.86 times as likely as men to be admitted Sex: Female Sex: Female Sex: Female Admit?: Yes Admit?: No Admit?: Yes Admit?: No Admit?: Yes Admit?: No Department: D Sex: Male 8 79 Department: E Sex: Male 5 8 Department: F Sex: Male 5 Sex: Female Sex: Female Sex: Female Admit?: Yes Admit?: No Admit?: Yes Admit?: No Admit?: Yes Admit?: No Department: A Sex: Male 5 Department: B Sex: Male 5 7 Department: C Sex: Male 5 Data in Table had been pooled over departments Stratified analysis: one fourfold display for each department Each table standardized to equate marginal frequencies Shading: highlight departments for which Ha : θi Fourfold displays for k tables SUGI 8 Michael Friendly Confidence rings do not overlap: θ Sex: Female Admit?: Yes Admit?: No 98 9 Quarter circles: radius nij area frequency Independence: Adjoining quadrants align Odds ratio: ratio of areas of diagonally opposite cells Confidence rings: Visual test of H : θ = adjoining rings overlap Sex: Male Fourfold displays for tables

9 What happened here? Simpson s paradox: Aggregate data are misleading because they falsely assume men and women apply equally in each field. But: Large differences in admission rates across departments. Men and women apply to these departments differentially. Women applied in large numbers to departments with low admission rates. (This ignores possibility of structural bias against women: differential funding of fields to which women are more likely to apply.) Other graphical methods can show these effects. SUGI 8 Michael Friendly The FOURFOLD program and the FFOLD macro The FOURFOLD program is written in SAS/IML. The FFOLD macro provides a simpler interface. Printed output: (a) significance tests for individual odds ratios, (b) tests of homogeneity of association (here, over departments) and (c) conditional association (controlling for department). Plot by department: berkf.sas %include catdata(berkeley); %ffold(data=berkeley, var=admit Gender, /* panel variables */ 5 by=dept, /* stratify by dept */ 6 down=, across=, /* panel arrangement */ 7 htext=); /* font size */ Aggregate data: first sum over departments, using the TABLE macro: 8 %table(data=berkeley, out=berk, 9 var=admit Gender, /* omit dept */ weight=count, /* frequency variable */ order=data); %ffold(data=berk, var=admit Gender); SUGI 8 Michael Friendly SUGI 8 5 Michael Friendly Hair Color Black Brown Red Blond Brown Eye Color Blue Hazel 5 5 Green Height/width marginal frequencies, ni+, n+j Area expected frequency, ni+n+j Shading observed frequency, nij, color: sign(nij ˆmij). Independence: Shown when density of shading is uniform. Sieve diagrams SUGI 8 Michael Friendly Black Brown Red Blond Hair Color Brown Eye Color Blue Hazel Expected frequencies: Hair Eye Color Data Green When row/col variables are independent, nij ni+n+j each cell can be represented as a rectangle, with area = height width frequency, nij count area Two-way frequency tables: Sieve diagrams

10 SUGI 8 7 Michael Friendly Left Eye Grade High Low Low Right Eye Grade High Unaided distant vision data Vision classification data for 777 women Sieve diagrams SUGI 8 6 Michael Friendly Hair Color Black Brown Red Blond Brown Eye Color Green Hazel Blue Effect ordering: Reorder rows/cols to make the pattern coherent Sieve diagrams SUGI 8 9 Michael Friendly Admitted Standardized residuals: Rejected 9 78 <- -:--:- : : > Model: (Gender)(Admit) area frequency, nij = ni+p j i Height relative proportions of other variable, p j i = nij/ni+ Width one set of marginals, ni+ Hartigan and Kleiner (98), Friendly (99, 999): Mosaic displays and Log-linear Models SUGI 8 8 Michael Friendly proc iml; %include iml(sieve); *-- frequency table; tab = { , , , }; 8 *-- variable and level names; 9 vnames = { Right Eye Grade Left Eye Grade }; lnames = { High Low, High Low }; title = { Unaided distant vision data }; *-- Global options; font= hwpsl ; 5 run sieve(tab, vnames, lnames, title ); 6 quit; sieve.sas Sieve diagrams: Example

11 SUGI 8 Michael Friendly Model [Dept] [Gender]: G (5) =.6. Note: Departments ordered A F by overall rate of admission. Model: (Dept)(Gender) Did men and women apply differentially to departments? Did departments differ in the total number of applicants? Departments Gender: Mosaic displays SUGI 8 Michael Friendly Admitted Standardized residuals: Rejected 9 78 <- -:--:- : : > Model: (Gender)(Admit) E.g., aggregate Berkeley data, independence model: Independence: Rows align, or cells are empty! Sign: negative in red; + positive in blue Magnitude: intensity of shading: dij >,,,... Shading: Sign and magnitude of Pearson χ residual, dij = (nij ˆmij)/ ˆmij (or L.R. G ) SUGI 8 Michael Friendly Admitted Rejected Model: (DeptGender)(Admit) E.g., Joint independence (null model, Admit as response) [G () = 877.]: SUGI 8 Michael Friendly DATA (size of tiles) (some) marginal frequencies (spacing visual grouping) RESIDUALS (shading) what associations have been omitted? Each mosaics shows: [AB][BC] log mijk = µ + λ A i + λ B j + λ C k + λ AB ij + λ BC jk e.g., the model for conditional independence (A C B): Saturated model [ABC] (none) All two-way associations [AB][AC][BC] (none) Mutual independence [A][B][C] A B C Joint independence [AB][C] (A B) C Conditional independence [AC][BC] (A B) C Model Model symbol Independence interpretation Table : Log-linear Models for Three-Way Tables Can fit any log-linear model (e.g., -way), Generalizes to n-way tables: divide cells recursively Mosaic displays for multiway tables

12 SUGI 8 5 Michael Friendly Enter frequency table directly in SAS/IML, or read from a SAS dataset. Most flexible: Select, collapse, reorder, re-label table levels using SAS/IML statements Specify structural s, fit specialized models (e.g., quasi-independence) Interface to models fit using PROC GENMOD SAS/IML modules: mosaics.sas program Examples: Many in VCD and on web site Documentation & software: Runs the current version of mosaics via a cgi-script Can run sample data, upload a data file, enter data in a form. Choose model fitting and display options (not all supported). Demonstration web applet: Software for Mosaic Displays SUGI 8 Michael Friendly Admitted Rejected Fits poorly, overall (G (6) =.7) But, only in Department A! E.g., Add [Dept Admit] association Conditional independence: Model: (DeptGender)(DeptAdmit) Pattern of lack-of-fit (residuals) better model smaller residuals cleaning the mosaic better model empty cells best done interactively! Visual fitting: Mosaic displays for multiway tables Software for Mosaic Displays Macro interface: mosaic macro, table macro, mosmat macro mosaic macro Easiest to use: Direct input from a SAS dataset No knowledge of SAS/IML required Reorder table variables; collapse, reorder table levels with table macro Convenient interface to partial mosaics (BY=) table macro Create frequency table from raw data Collapse, reorder table categories Re-code table categories using SAS formats, e.g., = Male = Female mosmat macro Mosaic matrices analog of scatterplot matrix (Friendly, 999) SUGI 8 6 Michael Friendly Software for Mosaic Displays Other implementations: JMP and SAS/INSIGHT both provide rudimentary mosaic displays (two-way only, no interface with model-fitting engines (shame!). The R-Project ( now provides the vcd package, implementing most of the graphical methods from VCD. Truly interactive mosaic displays have been implemented in: Lisp-Stat (ViSta) Java (Mondrian) SUGI 8 7 Michael Friendly

13 mosaic macro example: Berkeley data berkeley.sas title Berkeley Admissions data ; proc format; value admit ="Admitted" ="Rejected" ; value dept ="A" ="B" ="C" ="D" 5="E" 6="F"; 5 value $sex M = Male F = Female ; 6 data berkeley; 7 do dept = to 6; 8 do gender = M, F ; 9 do admit =, ; input output; end; end; end; /* -- Male -- - Female- */ /* Admit Rej Admit Rej */ 5 datalines; /* Dept A */ /* B */ /* C */ /* D */ /* E */ 5 7 /* F */ ; SUGI 8 8 Michael Friendly mosaic macro example: Berkeley data mosaic9m.sas goptions hsize=7in vsize=7in; %include catdata(berkeley); 5 *-- apply character formats to numeric table variables; 6 %table(data=berkeley, 7 var=admit Gender Dept, 8 weight=freq, 9 char=y, format=admit admit. gender $sex. dept dept., order=data, out=berkeley); %mosaic(data=berkeley, vorder=dept Gender Admit, /* reorder variables */ plots=:, /* which plots? */ 5 fittype=joint, /* fit joint indep. */ 6 split=h V V, htext=); /* options */ SUGI 8 9 Michael Friendly SUGI 8 5 Michael Friendly 9 title = Model: &MODEL ; plots=:; fittype= joint ; run mosaic(dim, table, vnames, lnames, plots, title); quit; 8 proc iml; *-- Berkeley Admissions data; dim = { 6}; vnames = {"Admit" "Gender" "Dept"}; /* var names */ 5 lnames = { /* level names */ 6 "Admitted" "Rejected" " " " " " " " ", 7 "Male" "Female" " " " " " " " ", 8 "A" "B" "C" "D" "E" "F"}; 9 /* Admit Not Admit Not */ table = { 5, 89 9, 5 7, 7 8, 5, 9, 8 79,, 5 8, 9 99, 5 5, 7}; 6 reset storage=mosaic.mosaic; 7 load module=_all_; *-- Load mosaic modules; mosaic9.sas SAS/IML example SUGI 8 5 Michael Friendly Two-way, Dept. by Gender Three-way, Dept. by Gender by Admit Admitted Rejected Model: (Dept)(Gender) Model: (DeptGender)(Admit) mosaic macro example: Berkeley data

14 SUGI 8 5 Michael Friendly δj=β Gender : effect of gender in Dept. A. β Dept i : effect on admissions of department, Lij = log(mij/mij): log odds of admission for males as vs. females, where, Lij = log(mij/mij) = α + β Dept i + δj=β Gender log mijk = µ + λ A i + λ D j + λ G k + λ AD ij + λ DG jk + δj=λ AG ik Admit Gender Dept, except for Dept. A Lij = log(mij/mij) = α + β Dept i log mijk = µ + λ A i + λ D j + λ G k + λ AD ij + λ DG jk Admit Gender Dept (conditional independence) For a binary response, each loglinear model is equivalent to a logit model (logistic regression, with categorical predictors) Logit models SUGI 8 5 Michael Friendly Admit Reject Dept Admit Reject Gender Admit Admit Reject Admit Reject %include catdata(berkeley); %mosmat(data=berkeley, vorder=admit Gender Dept, sort=no); mosmat macro: Mosaic matrices SUGI 8 55 Michael Friendly Department - Gender Female Male Log Odds (Admitted) - - logit(admit) = Dept DeptA*Gender proc catmod order=data data=berkeley;... response / out=predict; model admit = dept deptag / ml; %catplot(data=predict, xc=dept, class=gender, type=function, z=.96, legend=legend); Admit Gender Dept, except for Dept. A Fit: PROC CATMOD; plot: CATPLOT macro Plots for logit models SUGI 8 5 Michael Friendly Visualization procedures CATPLOT macro - plot predicted, observed log odds from CATMOD INFLGLIM macro - influence plots for generalized linear models HALFNORM macro - half-normal plot of residuals for generalized linear models Fitting procedures PROC CATMOD PROC LOGISTIC PROC GENMOD / dist=poisson SAS/INSIGHT (Fit Y X) Options Distribution poisson Logit models

15 Plots for logit models Fit: PROC CATMOD; plot: CATPLOT macro Admit Gender Dept, except for Dept. A catberk6.sas %include catdata(berkeley); data berkeley; set berkeley; *-- Dummy variable for Gender in Dept A; 5 deptag = (gender= F ) * (dept=); 6 format dept dept.; 7 8 proc catmod order=data 9 data=berkeley; weight freq; population dept gender; direct deptag; response / out=predict; model admit = dept deptag / ml; 5 run; 6... SUGI 8 56 Michael Friendly Plots for logit models PROC CATMOD output: Maximum Likelihood Analysis of Variance Source DF Chi-Square Pr > ChiSq Intercept 9. <. dept <. deptag 6. <. Likelihood Ratio Analysis of Maximum Likelihood Estimates Standard Chi- Parameter Estimate Error Square Pr > ChiSq Intercept <. dept A <. B <. C D E <. deptag <. How to interpret? SUGI 8 57 Michael Friendly SUGI 8 59 Michael Friendly Department - Gender Female Male Log Odds (Admitted) - - logit(admit) = Dept DeptA*Gender catberk6.sas title logit(admit) = Dept DeptA*Gender ; %catplot(data=predict, x=dept, class=gender, type=function, /* plot the log odds */ 5 z=.96); /* 95% error bars */ Plots for logit models SUGI 8 58 Michael Friendly A M A F B M B F C M C F D M D F E M E F F M F F dept gender _OBS PRED SEPRED_ PROC CATMOD: observed and predicted logits: catberk6.sas 7 proc print data=predict; 8 id dept gender; 9 var _obs pred sepred_; format _numeric_ 6. dept dept.; where(_type_= FUNCTION ); Plots for logit models

16 SUGI 8 6 Michael Friendly 9 %inflglim(data=berkeley, class=dept gender admit, resp=freq, model=admit dept gender dept, dist=poisson, id=cell, 5 gx=hat, gy=streschi); 8 genberk.sas %include catdata(berkeley); *-- make a cell ID variable, joining factors; data berkeley; set berkeley; 5 cell = trim(put(dept,dept.)) 6 gender 7 trim(put(admit,yn.)); Berkeley data, model [AD][GD] Lij = α + β Dept i INFLGLIM macro: Example SUGI 8 6 Michael Friendly Leverage (H value) - AM+ - Adjusted Pearson residual - - FM- BM- CF+ CM- EF+ BM+ FF- AM- AF- CF- DM- DF- EM- EF- AF+ 5 Fit: PROC GENMOD; calculates additional diagnostic measures (Hat value, Cook s D, etc.) Plot: measures of residual (GY= χ, χ residual) vs. leverage (GX=hat value), bubble size (area, radius) Cook s D. which cells have undue impact on fitted model? INFLGLIM macro: Influence plots for generalized linear models (Williams, 987) Diagnostic plots for Generalized Linear Models SUGI 8 6 Michael Friendly %halfnorm(data=berkeley, class=dept gender admit, resp=freq, model=dept gender dept admit, 5 dist=poisson, id=cell); genberk.sas Plot ordered absolute residuals, r (i) vs. expected normal values, z (i) Standard normal confidence envelope not suitable for GLMs Simulate reference line and envelope with simulated confidence intervals HALFNORM macro: Half-normal plot of residuals (Atkinson, 98) Diagnostic plots for Generalized Linear Models SUGI 8 6 Michael Friendly All cells which do not fit ( ri > ) are for department A. Males applying to dept A have large leverage large influence (Cook s D) Leverage (H value) AM+ - Adjusted Pearson residual - - CF- EF- DM- FM- BM- DF- CF+ CM- EF+ EM- BM+ FF- AM- AF- AF+ 5 INFLGLIM macro: Example

17 SUGI 8 65 Michael Friendly Quantitative regressors: age, dose Transformed regressors: age, log(dose) Polynomial regressors: age, age, Categorical predictors: treatment, sex Interaction regessors: treatment age, sex age Explanatory variables: Binary response: success/failure, vote: yes/no Ordinal response: none, some, severe depression Polytomous response: vote Liberal, Tory, Aliance, NDP Response variable: Logistic regression models SUGI 8 6 Michael Friendly Points with largest residual labeled The model fits well, except in department A. Expected value of half normal quantile Absolute Std Deviance Residual AF+ AM-AM+ EF+ AF- 5 Logistic regression models: Binary response For a binary response, Y (, ), let x be a vector of p regressors, and πi be the probability, Pr(Y = x). The logistic regression model is a linear model for the log odds, or logit that Y =, given the values in x, ( ) πi logit(πi) log = α + x T i β πi = α + βxi + βxi + + βpxip An equivalent (non-linear) form of the model may be specified for the probability, πi, itself, πi = { + exp( [α + x T i β])} The logistic model is a linear model for the log odds, but also a multiplicative model for the odds of success, πi = exp(α + x T i β) = exp(α) exp(x T i β) πi so, increasing xij by increases logit(πi) by βj, and multiplies the odds by e βj. SUGI 8 66 Michael Friendly Logistic regression models: Binary response Fitting: PROC LOGISTIC (or ROBUST macro M-estimation) Data: Frequency form (from PROC FREQ) when all predictors are discrete Case form when any predictors are quantitative Models: CLASS statement (V7+) no need for dummy variables discrete predictors can specify order and parameterization (effect, polynomial, reference cell) MODEL statement allows GLM syntax, e.g., model Better = Sex Treat SUGI 8 67 Michael Friendly

18 Logistic regression models: Binary response Visualization: Goal: see and understand the data and fitted model LOGODDS macro: Plot observed responses, fitted and smoothed probabilities Model plots: OUTPUT statement fitted ˆπi, lower/upper ( α) CI, and/or fitted logit, (α + x T i ˆβ) ± z α/ se(logit) Plot with standard procedures (PROC GCHART, GPLOT) Utility macros (BARS, LABEL, POINTS, PSCALE, etc.) for custom displays Effect plots plot hierarchical subset of effects, averaging over those not included. INFLOGIS macro: Influence plots for logistic regression models ADDVAR macro: Added variable plots for new predictors or transformations of old SUGI 8 68 Michael Friendly Example: Arthritis treatment data Predictors: Sex, Treatment (treated, placebo), Age Response: improvement (none, some, marked) Consider first as binary response: None vs. (Some or Marked)= Better Data in case form: arthrit.sas data arthrit; length treat $7. sex $6. ; input id treat $ sex $ age ; case = _n_; 5 better = (improve > ); *-- Make binary response; 6 datalines ; 7 57 Treated Male 7 9 Placebo Male Treated Male 9 Placebo Male 9 77 Treated Male 7 Placebo Male 5... (observations omitted ) 56 Treated Female 69 Placebo Female 66 Treated Female 7 5 Placebo Female 66 7 Placebo Female 68 Placebo Female 7 5 ; SUGI 8 69 Michael Friendly SUGI 8 7 Michael Friendly AGE Log Odds Better= - SUGI 8 7 Michael Friendly %include catdata(arthrit); %logodds(data=arthrit, x=age, y=better, /* vars to plot */ smooth=.5, /* LOWESS smoothing parameter */ 5 plot=logit); /* plot on logit scale */ Show the data: Plot (/) responses [stacked or jittered] ( ) yi+/ Divide X into groups (e.g., deciles), emprical logit, log ni yi+/, for each Linear logistic regression, plus smoothed curve (LOWESS macro) The LOGODDS macro: Smoothing: Discrete data often requires smoothing to see! Linearity: Is a linear relation realistic? LOGODDS macro: Empirical logit plots

19 PROC LOGISTIC: Model fitting and plotting Specify ordering of response levels (order= or descending options) Specify parameterizations for CLASS variables OUTPUT statement to get fitted logits and probabilities glogistc.sas proc logistic data=arthrit descending; class sex (ref=last) treat (ref=first) / param=ref; model better = sex treat age; 5 output out=results 6 p=prob l=lower u=upper 7 xbeta=logit stdxbeta=selogit / alpha=.; The output includes: Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq sex treat age SUGI 8 7 Michael Friendly Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept sex Female treat Treated age Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits sex Female vs Male.7.8. treat Treated vs Placebo age Parameter estimates (reference cell coding): β =.9 Females e.9 =. times more likely to be better than Males β =.76 Treated e.76 =5.8 times more likely to be better than Placebo β =.87 odds ratio=.5 odds of improvement increase 5% each year. Over years, odds of improvement multiplied by e.86 =.6, a 6% increase. SUGI 8 7 Michael Friendly PROC LOGISTIC: Model plots Plots of fitted values from the dataset specified on the OUTPUT statement Plot either predicted probabilities or logits The first few observations from the results dataset: id sex treat age better prob lower upper logit selogit 57 Male Treated Male Placebo Male Treated Male Placebo Male Treated Male Placebo SUGI 8 7 Michael Friendly PROC LOGISTIC: Model plots Basic plots: Plot either logit or probability vs. one predictor (continuous or most levels) Separate curves for one factor Separate panels for all others (BY statement) proc gplot data=results; plot (logit prob) * age = treat; by sex; symbol v=circle i=join l= c=black; symbol v=dot i=join l= c=red; SUGI 8 75 Michael Friendly

20 SUGI 8 77 Michael Friendly 9 *-- Error bars, on logit scale; %bars(data=results, var=logit, class=age, cvar=treat, by=age, barlen=selogit, out=bars); *-- Custom legends and panel labels; 5 %label(data=results, y=logit, x=age, xoff=, cvar=treat, 6 by=sex, subset=last.treat, out=label, pos=6, text=treat); 7 %label(data=results, y=.5, x=, size=, 8 by=sex, subset=first.sex, out=label, pos=6, text=sex); 9 *-- Probability scales at right; %pscale(out=pscale, byvar=sex, byval=%str( Female, Male )); *-- Join ANNOTATE datasets; 5 data bars; 6 set label label bars pscale; 7 proc sort; 8 by sex; glogistc.sas Enhanced plots: PROC LOGISTIC: Model plots SUGI 8 76 Michael Friendly Age Age Placebo. Log Odds Improved Placebo.6 Probability Improved Log Odds Improved Treated Probability Improved Female Male Treated Plot on logit scale, with probability scale at right (PSCALE macro) Show 67% error bars ± se (BARS macro) Custom legend and panel labels (LABEL macro) Enhanced plots: PROC LOGISTIC: Model plots SUGI 8 79 Michael Friendly Age Age Treated Placebo. - Log Odds Improved Probability Improved Probability Improved.7 Log Odds Improved.7 Placebo.8.8 Female.9 Treated Male proc logistic data=arthrit descending; class sex (ref=last) treat (ref=first) / param=ref; model better = sex treat output out=results p=prob l=lower u=upper 5 xbeta=logit stdxbeta=selogit / alpha=.; Only need to change the MODEL statement Output dataset automatically incorporates all model terms Plotting steps remain exactly the same Plotting fitted values Models with interactions SUGI 8 78 Michael Friendly Age Age Placebo. - Log Odds Improved Placebo Probability Improved - Treated.7 Log Odds Improved Probability Improved Female Treated.9 Male title h=.8 a=-9 Probability Improved /* right axis label */ h=.5 a=-9 ; /* extra space */ goptions hby=; /* suppress BY values */ proc gplot data=results; 5 plot logit * age = treat / 6 vaxis=axis haxis=axis hm= vm= 7 nolegend anno=bars frame; 8 by sex; 9 axis label=(a=9 Log Odds Improved ) order=(- to ); axis order=( to 8 by ) offset=(,6); symbol v=+ i=join l= c=black; symbol v=- i=join l= c=red; label age= Age ; 5 run; glogistc.sas

21 Effect plots for generalized linear models Fox (987) For complex models, often wish to plot a specific main effect or interaction (including lower-order relatives) Fit full model to data with linear predictor (e.g., logit) η = Xβ and link function g(µ) = η estimate b of β and covariance matrix V (b) of b. Vary each predictor in the term over its range Fix other predictors at typical values (mean, median, proportion in the data) effect model matrix, X Calculate fitted effect values, ˆη = X b. Standard errors are square roots of diag(x V (b)x T ) Plot ˆη, or values transformed back to scale of response, g ( ˆη ). Note: This provides a general means to visualize interactions in all linear and generalized linear models. SUGI 8 8 Michael Friendly Effect plots in SAS Create a grid of values for predictors in the effect (EXPGRID macro) Fix other predictors at typical values (mean, median, proportion in the data) Concatenate grid with data Fit model output data set fitted values in the grid Standard errors automatically calculated Plot fitted values in the grid (Not yet a macro) SUGI 8 8 Michael Friendly SUGI 8 8 Michael Friendly 7 *-- select grid, replace labels; 8 data effect; 9 set predicted (where =(_in_=)); label logit= log odds of Volunteering prob = Probability of Volunteering ; 6 *-- fit model, output fitted values; proc logistic data=cowles outest=parm covout; model Volunter = Sex Extraver Neurot / covb; output out=predicted xbeta=logit stdxbeta=selogit 5 p=prob u=upper l=lower / alpha=.; 7 *-- catenate grid values with data; 8 data cowles; 9 set cowles _grid_; 6 %include catdata(cowles); %expgrid(sex=.5, /* Fix Sex at mid value */ Extraver= to by, /* range of predictors */ Neurot= to by 6, /* " " */ 5 _in_=); /* grid select value */ cowles.sas Effect plots: Example SUGI 8 8 Michael Friendly Extraversion Probability of Volunteering Neuroticism 6 8 Predictors: Sex, Neuroticism, Extraversion strong interaction, Neuroticism Extraversion Cowles and Davis (987) Volunteering for a psychology experiment Effect plots: Example

22 6 SUGI 8 85 Michael Friendly C, CBAR analogs of Cook s D in OLS standardized change in regression coefficients when i-th case is deleted. DIFCHISQ, DIFDEV χ when i-th case is deleted. Influence: Actual impact of an individual case leverage residual Residuals: Which observations are poorly fitted? Leverage: Potential impact of an individual case distance from the centroid in space of predictors Influence measures and diagnostic plots SUGI 8 8 Michael Friendly 6 8 Extraversion.. Probability of Volunteering Neuroticism.9 *-- Custom legend; %label(data=effect, 5 x=extraver, y=prob, 6 subset=extraver=, /* at last.extraver */ 7 text=put(neurot,.), /* label text */ 8 pos=6, xoff=., out=labels); 9 *-- Plot step; proc gplot data=effect; plot prob * Extraver = Neurot / vaxis=axis haxis=axis vm= anno=labels nolegend; 5 symbol v=dot i=spline r=5; 6 axis label=(a=9 r=) order=( to.9 by.); 7 axis order=( to by 6) offset=(,); 8 run; quit; cowles.sas Effect plots: Example Influence measures and diagnostic plots PROC LOGISTIC provides printed output with the influence and iplots options proc logistic data=arthrit ; model better = sex treat age / influence iplots; The LOGISTIC Procedure Regression Diagnostics Deviance Residual Hat Matrix Diagonal Case ( unit =.6) ( unit =.) Number Value Value *.89 *.6 *. *.685 *.87 *.5 *. * 5.7 *.86 * 6.88 *.8 * 7.7 *.8 * 8.99 *.9 * 9.96 *.66 *.5 *. *. *.6 *.5 *. *. *.65 *.599 *.5 * 5. *.65 * *.5 * SUGI 8 86 Michael Friendly 7.9 *.69 * 8.6 *.58 * 9.9 *.69 *.6 *.58 *. *.7 *.8 *.6 *. *.7 *.59 *.6 * 5.9 *.78 * 6.69 *.5 * 7.99 *.8 *... Problems: Way too much output Doesn t highlight unusual cases well Index plots don t consider combinations of measures SUGI 8 87 Michael Friendly

23 SUGI 8 89 Michael Friendly Male Treated Male Placebo Female Placebo Female Placebo Female Treated Female Treated case better sex treat age pred hat difchisq difdev c Printed output lists cases with large leverage, residual or influence: %include data(arthrit); %inflogis(data=arthrit, class=sex treat, /* CLASS variables */ y=better, /* response */ 5 x=sex treat age, /* predictors */ 6 id=case, /* case ID */ 7 gy=difchisq, /* graph ordinate */ 8 gx=pred HAT); /* graph abscissas */ INFLOGIS macro: Example SUGI 8 88 Michael Friendly Estimated Probability Leverage (Hat value) Change in Pearson Chi Square Change in Pearson Chi Square Arthritis treatment data Bubble size: Influence on Coefficients (C) Arthritis treatment data Bubble size: Influence on Coefficients (C) Specialized version of INFLGLIM macro for logistic regression Plots a measure of change in χ (DIFCHISQ or DIFDEV) vs. predicted probability or leverage. Bubble symbols show actual influence (C or CBAR) Shows standard cutoffs for large values Labels outlying cases INFLOGIS macro SUGI 8 9 Michael Friendly Leverage (Hat value) Change in Pearson Chi Square Arthritis treatment data Bubble size: Influence on Coefficients (C) INFLOGIS macro: Example SUGI 8 9 Michael Friendly Estimated Probability Change in Pearson Chi Square Arthritis treatment data Bubble size: Influence on Coefficients (C) INFLOGIS macro: Example

24 Conclusions Summarization & exposure Effective data analysis requires summarization hypothesis tests, model fits (& comparisons!), parameter estimates (& precision!) Also requires exposure displays to help the viewer see (& understand!) patterns, trends, and anomalies. Graphical methods for categorical data Many new methods developed over the last 5 years Some novel, others extend familiar methods for quantitative data Described and illustrated in VCD Theory into practice To be useful, statistical methods must be: available implemented in standard software accessible easy to use (or at least easier) VCD provides general macros and SAS/IML programs SUGI 8 9 Michael Friendly References Atkinson, A. C. Two graphical displays for outlying and influential observations in regression. Biometrika, 68:, 98. Bickel, P. J., Hammel, J. W., and O Connell, J. W. Sex bias in graduate admissions: Data from Berkeley. Science, 87:98, 975. Cowles, M. and Davis, C. The subject matter of psychology: Volunteers. British Journal of Social Psychology, 6:97, 987. Fox, J. Effect displays for generalized linear models. In Clogg, C. C., editor, Sociological Methodology, 987, pp Jossey-Bass, San Francisco, 987. Friendly, M. Mosaic displays for multi-way contingency tables. Journal of the American Statistical Association, 89:9, 99. Friendly, M. Extending mosaic displays: Marginal, conditional, and partial views of categorical data. Journal of Computational and Graphical Statistics, 8():7 95, 999. Friendly, M. Multidimensional arrays in SAS/IML. In Proceedings of the SAS User s Group International Conference, volume 5, pp. 7. SAS Institute,. Friendly, M. Corrgrams: Exploratory displays for correlation matrices. The American Statistician, 56():6,. Friendly, M. and Kwan, E. Effect ordering for data displays. Computational Statistics and Data Analysis, 7,. In press. Hartigan, J. A. and Kleiner, B. Mosaics for contingency tables. In Eddy, W. F., editor, Computer Science and Statistics: Proceedings of the th Symposium on the Interface, pp Springer-Verlag, New York, NY, 98. SUGI 8 9 Michael Friendly Hoaglin, D. C. and Tukey, J. W. Checking the shape of discrete distributions. In Hoaglin, D. C., Mosteller, F., and Tukey, J. W., editors, Exploring Data Tables, Trends and Shapes, chapter 9. John Wiley and Sons, New York, 985. Ord, J. K. Graphical methods for a class of discrete distributions. Journal of the Royal Statistical Society, Series A, : 8, 967. Tufte, E. R. The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT, 98. Tukey, J. W. Some graphic and semigraphic displays. In Bancroft, T. A., editor, Statistical Papers in Honor of George W. Snedecor. Iowa State University Press, Ames, IA, 97. Tukey, J. W. Exploratory Data Analysis. Addison Wesley, Reading, MA, 977. Williams, D. A. Generalized linear model diagnostics using the deviance and single case deletions. Applied Statistics, 6:8 9, 987. SUGI 8 9 Michael Friendly

Michael Friendly. Categorical Data Analysis with Graphics

Michael Friendly. Categorical Data Analysis with Graphics -5 0 5 Sqrt(frequency) 10 15 20 25 30 35 40 April, 2003 SUGI 28 York University, Toronto, Michael Friendly Black Brown Red Blond 0 2 4 6 8 10 12 Number of males Left Eye Grade High