Applied Multivariate Analysis - PDF Free Download

Department of Mathematics and Statistics, University of Vaasa, Finland Spring 2017

Choosing Statistical Method

1 Choice an appropriate method 2 Cross-tabulation More advance analysis of frequency tables (Log-linear models) 3 Regression

Scales of Measurements Measurement is a process by which numbers or symbols are attached to given characteristics of an object according to predetermined rules. Main scales: Nominal: classification (similar, different) Ordinal: in addition to nominal scale, ordering Interval: in addition to previous, differences between two measurements are meaningful, however no fixed origin (zero point) Ratio: in addition to previous, fixed origin (zero point).

Dependent-Independent Variables: Statistical Methods Depending on the scale, multivariate analysis can be applied (roughly) according to the following table in a dependentindependent variable analysis:

Dependent-Independent Variables: Statistical Methods Dependent Variable(s) One More than One Metric Nonmetric Metric Nonmetric Indep. vars One Metric Regression Discriminant Canonical Multiple analysis (RA) analysis (DA) correlation DA (MDA) Logistic regression Non- t-test Discrete DA MANOVA Multiple metric (DDA) (MDDA) More Metric Multiple RA DA Canonical MDA correlation Structural equations Non- ANOVA DDA MANOVA MDDA metric Conjoint an.

Analysis of Interdependencies Most common methods for analyzing interdependencies (without causal relationships) are: Type of Data No of Variables Metric Nonmetric Two Simple Two-way correlation contingency tables Cluster analysis Loglinear models More Principal Multiway component contingency analysis tables Factor anlaysis Cluster analysis Loglinear models Correspondence analysis

1 Choice an appropriate method 2 Cross-tabulation More advance analysis of frequency tables (Log-linear models) 3 Regression

Analysis of frequency tables Analysis of dependencies between two classification variables can be analyzed using cross-tabulation. X 1 2 c Sum 1 f 11 f 12 f 1c f 1. 2 f 21 f 22 f 2c f 2. Y..... r f r1 f r2 f rc f r. Sum f.1 f.2 f.c n f ij is the number of observations in Y class i and X class j, f.j = r i=1 f ij, f i. = c i=1 f ij, and n = r c i=1 j=1 f ij is the total number of observations.

Example 1 (Source: Base SAS Procedures Guide 9.2, Example 3.1): The eye and hair color of children from two different regions of Europe are recorded in the data set Color. Instead of recording one observation per child, the data are recorded as cell counts, where the variable Count contains the number of children exhibiting each of the 15 eye and hair color combinations. The data set does not include missing combinations. data color; input region eyes $ hair $ count @@; /* @@ allows reading several obs per line */ label eyes = Eye Color hair = Hair Color region= Geographic Region ; datalines; 1 blue fair 23 1 blue red 7 1 blue medium 24 1 blue dark 11 1 green fair 19 1 green red 7 1 green medium 18 1 green dark 14 1 brown fair 34 1 brown red 5 1 brown medium 41 1 brown dark 40 1 brown black 3 2 blue fair 46 2 blue red 21 2 blue medium 44 2 blue dark 40 2 blue black 6 2 green fair 50 2 green red 31 2 green medium 37 2 green dark 23 2 brown fair 56 2 brown red 42 2 brown medium 53 2 brown dark 54 2 brown black 13 ;

SAS commands The data can be presented in contingency tables Two-way tables: /* depedency of region and eye color */ proc freq data = color; title "Region and Eye Color of European Children"; tables region*eyes /chisq norow nocol nopercent; weight count; /* Note: important to weight by count due to the format of the data */ run;

SAS results OUTPUT: Region and Eye Color of European Children ======================================================= Eye Color ------------------- Geographic Region blue brown green Total ------------------------------------------------------ 1 65 123 58 246 26.42 50.00 23.58 100 2 157 218 141 516 30.43 42.25 27.33 100 ------------------------------------------------------ Total 222 341 199 762 ====================================================== Statistics for Table of region by eyes ================================================== Statistic DF Value Prob -------------------------------------------------- Chi-Square 2 4.0496 0.1320 <- Chi-square for independence (not significant) Likelihood Ratio Chi-Square 2 4.0397 0.1327 Mantel-Haenszel Chi-Square 1 0.0020 0.9646 -------------------------------------------------- Phi Coefficient 0.0729 Contingency Coefficient 0.0727 Cramer s V 0.0729 ================================================== Sample Size = 762

Exampple: Frequency tables The chi-square value of 4.0496 with 2 degrees of freedom corresponds a p-value of 0.1320 which indicates that there is no convincing empirical evidence of dependence between region and eye color. Similar analysis for hair color (see SAS example on the web page) shows that hair color is related to the region χ 2 (4) = 20.5, with p-value 0.0004. The percentage distributions of the SAS output show that medium hear color is more frequent in region 1 while red is more frequent in region 2. Similar analysis of dependence between hair color and eye color suggest dependence between them, χ 2 (8) = 20.9, p =.0073. The major source of the dependence seems to be that dark brown eyes seem to be related to dark hair than other eye colors.

Example: Frequency tables Finally we can analyze the relation of eye and hair colors separately in different regions by running three way tables. /* 3-way contingency table */ /* the first variable become a kind of control variable */ proc freq data = color; title "Three-way table of Region, Eye color, and Hair color"; tables region*hair*eyes / chisq norow nocol nopercent; /* Note: tables are formed by region */ weight count; run; The results show that the dependence is in particular in region 2.

More advance analysis of frequency tables (Log-linear models) 1 Choice an appropriate method 2 Cross-tabulation More advance analysis of frequency tables (Log-linear models) 3 Regression

More advance analysis of frequency tables (Log-linear models) Intuition of log-linear models Consider two categorial variable A and B and let f ij be the number of observation (frequency) in A s category i and B s category j. Let η ij = E[f ij ] denote the expected value of f ij, i.e., the expected number of observations out of n observations falling to A s class i and B s class j. Denoting further f i. = b j=1 f ij and f.j = a i=1 f ij (a is the number of A categories and b the number of B categories) the marginal totals (frequencies) with expected values η i. and η.j.

More advance analysis of frequency tables (Log-linear models) Log linear models Writing η ij = η i. η.j η ij /(η i. η.j ) and taking logarithms, we get log(η ij ) = λ + λ i + λ j + λ ij, (1) where λ is related to average frequency, λ i indicates the marginal contribution of A and λ j indicates B s marginal contribution on the expected frequency, and finally λ ij indicates the joint effect of A and B. In this simple case if λ ij = 0 then variables A and B are independent.

More advance analysis of frequency tables (Log-linear models) Log linear models The above generalizes to higher order tables. Consider three variables A, B, and C with f ijk denoting the number observations in cell (i, j, k) (i.e., in the intersection where A s class i, Bs class j, and Cs class k. We are interested on the following models Model A + B + C AB + C AC + B A + BC AB + AC AB + BC AC + BC AB + AC + BC ABC Dependence structure Independence model (only marginal effect) A and B dependent, C independet A and C dependent, B independent A independent, B and C dependent A and B dependent, A and C dependent A and B dependent, B and C dependent A and C dependent, B and C dependent All pairwise dependencies Saturated model

More advance analysis of frequency tables (Log-linear models) Example: SAS CATMOD for log linear models See the SAS example on the course web page for an analysis with empirical data of the dependencies in the above table. As a homework, work out also the example on the course web page.

1 Choice an appropriate method 2 Cross-tabulation More advance analysis of frequency tables (Log-linear models) 3 Regression

Regression The basic regression model is of the form y i = β 0 + β 1 x i1 + β 2 x i2 + + β p x ip + u i (2) where i is the dependent variable and x ij are explanatory variables, u i is the error term, assumed independently and identically distributed (iid), i = 1,..., n (sample size), j = 1,..., p. Slope coefficient β j indicates the marginal effect of variable x j on the dependent variable (given that the other x-variable do not change (ceteris paribus condition), i.e., if x j changes by one unit y is expected to change by β j units. Note that when transformation are applied on the variables, interpretation of the slope coefficients must be adapted accordingly.

Regression Example 2 Using the wage data set referred to on the course web page, estimate we estimate the regression model log(wage) = β 0 +δ 1 female+δ f mfemale+δ mmmale+β 1 educ+β 2 exper+β 3 tenure+u. Questions: 1 Are there wage differences between genders? 2 Is there marriage premiums? 3 Does an additional year of education pay off equally well for women and men?