Applied Multivariate Analysis

Similar documents
Biostat Methods STAT 5820/6910 Handout #4: Chi-square, Fisher s, and McNemar s Tests

Factorial ANOVA. Skipping... Page 1 of 18

DSCI 325: Handout 10 Summarizing Numerical and Categorical Data in SAS Spring 2017

Data Mining. 2.4 Data Integration. Fall Instructor: Dr. Masoud Yaghini. Data Integration

Week 4: Simple Linear Regression II

SPSS Modules Features

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel

Introduction to Mixed Models: Multivariate Regression

This code and the crash data set can be found on the course web page.

Workload Characterization Techniques

Coding Categorical Variables in Regression: Indicator or Dummy Variables. Professor George S. Easton

Week 5: Multiple Linear Regression II

Topic 3: GIS Models 10/2/2017. What is a Model? What is a GIS Model. Geography 38/42:477 Advanced Geomatics

Strategies for Modeling Two Categorical Variables with Multiple Category Choices

Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975.

STA 570 Spring Lecture 5 Tuesday, Feb 1

Cell means coding and effect coding

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Lab #3: Probability, Simulations, Distributions:

Base package The Base subscription includes the following features:

General Factorial Models

General Factorial Models

Nuts and Bolts Research Methods Symposium

Organizing Your Data. Jenny Holcombe, PhD UT College of Medicine Nuts & Bolts Conference August 16, 3013

Applied Multivariate Analysis

Hierarchical Generalized Linear Models

Multiple Regression White paper

Loglinear and Logit Models for Contingency Tables

SAS Macros CORR_P and TANGO: Interval Estimation for the Difference Between Correlated Proportions in Dependent Samples

JMP Book Descriptions

Intermediate SAS: Statistics

Week 11: Interpretation plus

IBM SPSS Statistics Traditional License packages and features

Analysis of Complex Survey Data with SAS

Correctly Compute Complex Samples Statistics

Teaching students quantitative methods using resources from the British Birth Cohorts

STATISTICS (STAT) Statistics (STAT) 1

IBM SPSS Categories. Predict outcomes and reveal relationships in categorical data. Highlights. With IBM SPSS Categories you can:

Predict Outcomes and Reveal Relationships in Categorical Data

SAS data statements and data: /*Factor A: angle Factor B: geometry Factor C: speed*/

Loglinear Models for Categorical Data. Michael Friendly

Log-linear Models of Contingency Tables: Multidimensional Tables

Using SAS Macros to Extract P-values from PROC FREQ

Correctly Compute Complex Samples Statistics

Stat 342 Exam 3 Fall 2014

Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

186 Statistics, Data Analysis and Modeling. Proceedings of MWSUG '95

Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression

Inference for loglinear models (contd):

Mathematics (JUN11MPC201) General Certificate of Education Advanced Subsidiary Examination June Unit Pure Core TOTAL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

8. MINITAB COMMANDS WEEK-BY-WEEK

LIST OF TABLES. Page Title No.

Study Guide. Module 1. Key Terms

Frequency Distributions

Mathematics MPC2. General Certificate of Education Advanced Subsidiary Examination. Unit Pure Core 2

Frequency Tables. Chapter 500. Introduction. Frequency Tables. Types of Categorical Variables. Data Structure. Missing Values

Advanced Analytics with Enterprise Guide Catherine Truxillo, Ph.D., Stephen McDaniel, and David McNamara, SAS Institute Inc.

Chapter 2. Introduction to SAS. 2.1 The Four Main File Types

1. Solve the following system of equations below. What does the solution represent? 5x + 2y = 10 3x + 5y = 2

SAS Macros for Binning Predictors with a Binary Target

Bivariate (Simple) Regression Analysis

Statistical graphics in analysis Multivariable data in PCP & scatter plot matrix. Paula Ahonen-Rainio Maa Visual Analysis in GIS

SAS (Statistical Analysis Software/System)

DATA CLASSIFICATORY TECHNIQUES

Product Catalog. AcaStat. Software

Multivariate Capability Analysis

PSY 9556B (Feb 5) Latent Growth Modeling

Chapter 1. Using the Cluster Analysis. Background Information

Subset Selection in Multiple Regression

Introductory Applied Statistics: A Variable Approach TI Manual

CREATING SIMULATED DATASETS Edition by G. David Garson and Statistical Associates Publishing Page 1

Statistical Methods for the Analysis of Repeated Measurements

JMP Chong Ho

Data analysis using Microsoft Excel

Handling missing values in Analysis

Modelling Proportions and Count Data

book 2014/5/6 15:21 page v #3 List of figures List of tables Preface to the second edition Preface to the first edition

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition

STAT 3304/5304 Introduction to Statistical Computing. Introduction to SAS

Analysis of Variance in R

- 1 - Fig. A5.1 Missing value analysis dialog box

GET A GRIP ON MACROS IN JUST 50 MINUTES! Arthur Li, City of Hope Comprehensive Cancer Center, Duarte, CA

An introduction to SPSS

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS International General Certificate of Secondary Education MATHEMATICS

Simulation: Solving Dynamic Models ABE 5646 Week 12, Spring 2009

Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018

Modelling Proportions and Count Data

First steps in SPSS. Figure 1

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

Labor Economics with STATA. Estimating the Human Capital Model Using Artificial Data

Box-Cox Transformation for Simple Linear Regression

Statistics Lab #7 ANOVA Part 2 & ANCOVA

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

The DMSPLIT Procedure

Transcription:

Department of Mathematics and Statistics, University of Vaasa, Finland Spring 2017

Choosing Statistical Method

1 Choice an appropriate method 2 Cross-tabulation More advance analysis of frequency tables (Log-linear models) 3 Regression

Scales of Measurements Measurement is a process by which numbers or symbols are attached to given characteristics of an object according to predetermined rules. Main scales: Nominal: classification (similar, different) Ordinal: in addition to nominal scale, ordering Interval: in addition to previous, differences between two measurements are meaningful, however no fixed origin (zero point) Ratio: in addition to previous, fixed origin (zero point).

Dependent-Independent Variables: Statistical Methods Depending on the scale, multivariate analysis can be applied (roughly) according to the following table in a dependentindependent variable analysis:

Dependent-Independent Variables: Statistical Methods Dependent Variable(s) One More than One Metric Nonmetric Metric Nonmetric Indep. vars One Metric Regression Discriminant Canonical Multiple analysis (RA) analysis (DA) correlation DA (MDA) Logistic regression Non- t-test Discrete DA MANOVA Multiple metric (DDA) (MDDA) More Metric Multiple RA DA Canonical MDA correlation Structural equations Non- ANOVA DDA MANOVA MDDA metric Conjoint an.

Analysis of Interdependencies Most common methods for analyzing interdependencies (without causal relationships) are: Type of Data No of Variables Metric Nonmetric Two Simple Two-way correlation contingency tables Cluster analysis Loglinear models More Principal Multiway component contingency analysis tables Factor anlaysis Cluster analysis Loglinear models Correspondence analysis

1 Choice an appropriate method 2 Cross-tabulation More advance analysis of frequency tables (Log-linear models) 3 Regression

1 Choice an appropriate method 2 Cross-tabulation More advance analysis of frequency tables (Log-linear models) 3 Regression

Analysis of frequency tables Analysis of dependencies between two classification variables can be analyzed using cross-tabulation. X 1 2 c Sum 1 f 11 f 12 f 1c f 1. 2 f 21 f 22 f 2c f 2. Y..... r f r1 f r2 f rc f r. Sum f.1 f.2 f.c n f ij is the number of observations in Y class i and X class j, f.j = r i=1 f ij, f i. = c i=1 f ij, and n = r c i=1 j=1 f ij is the total number of observations.

Example 1 (Source: Base SAS Procedures Guide 9.2, Example 3.1): The eye and hair color of children from two different regions of Europe are recorded in the data set Color. Instead of recording one observation per child, the data are recorded as cell counts, where the variable Count contains the number of children exhibiting each of the 15 eye and hair color combinations. The data set does not include missing combinations. data color; input region eyes $ hair $ count @@; /* @@ allows reading several obs per line */ label eyes = Eye Color hair = Hair Color region= Geographic Region ; datalines; 1 blue fair 23 1 blue red 7 1 blue medium 24 1 blue dark 11 1 green fair 19 1 green red 7 1 green medium 18 1 green dark 14 1 brown fair 34 1 brown red 5 1 brown medium 41 1 brown dark 40 1 brown black 3 2 blue fair 46 2 blue red 21 2 blue medium 44 2 blue dark 40 2 blue black 6 2 green fair 50 2 green red 31 2 green medium 37 2 green dark 23 2 brown fair 56 2 brown red 42 2 brown medium 53 2 brown dark 54 2 brown black 13 ;

SAS commands The data can be presented in contingency tables Two-way tables: /* depedency of region and eye color */ proc freq data = color; title "Region and Eye Color of European Children"; tables region*eyes /chisq norow nocol nopercent; weight count; /* Note: important to weight by count due to the format of the data */ run;

SAS results OUTPUT: Region and Eye Color of European Children ======================================================= Eye Color ------------------- Geographic Region blue brown green Total ------------------------------------------------------ 1 65 123 58 246 26.42 50.00 23.58 100 2 157 218 141 516 30.43 42.25 27.33 100 ------------------------------------------------------ Total 222 341 199 762 ====================================================== Statistics for Table of region by eyes ================================================== Statistic DF Value Prob -------------------------------------------------- Chi-Square 2 4.0496 0.1320 <- Chi-square for independence (not significant) Likelihood Ratio Chi-Square 2 4.0397 0.1327 Mantel-Haenszel Chi-Square 1 0.0020 0.9646 -------------------------------------------------- Phi Coefficient 0.0729 Contingency Coefficient 0.0727 Cramer s V 0.0729 ================================================== Sample Size = 762

Exampple: Frequency tables The chi-square value of 4.0496 with 2 degrees of freedom corresponds a p-value of 0.1320 which indicates that there is no convincing empirical evidence of dependence between region and eye color. Similar analysis for hair color (see SAS example on the web page) shows that hair color is related to the region χ 2 (4) = 20.5, with p-value 0.0004. The percentage distributions of the SAS output show that medium hear color is more frequent in region 1 while red is more frequent in region 2. Similar analysis of dependence between hair color and eye color suggest dependence between them, χ 2 (8) = 20.9, p =.0073. The major source of the dependence seems to be that dark brown eyes seem to be related to dark hair than other eye colors.

Example: Frequency tables Finally we can analyze the relation of eye and hair colors separately in different regions by running three way tables. /* 3-way contingency table */ /* the first variable become a kind of control variable */ proc freq data = color; title "Three-way table of Region, Eye color, and Hair color"; tables region*hair*eyes / chisq norow nocol nopercent; /* Note: tables are formed by region */ weight count; run; The results show that the dependence is in particular in region 2.

More advance analysis of frequency tables (Log-linear models) 1 Choice an appropriate method 2 Cross-tabulation More advance analysis of frequency tables (Log-linear models) 3 Regression

More advance analysis of frequency tables (Log-linear models) Intuition of log-linear models Consider two categorial variable A and B and let f ij be the number of observation (frequency) in A s category i and B s category j. Let η ij = E[f ij ] denote the expected value of f ij, i.e., the expected number of observations out of n observations falling to A s class i and B s class j. Denoting further f i. = b j=1 f ij and f.j = a i=1 f ij (a is the number of A categories and b the number of B categories) the marginal totals (frequencies) with expected values η i. and η.j.

More advance analysis of frequency tables (Log-linear models) Log linear models Writing η ij = η i. η.j η ij /(η i. η.j ) and taking logarithms, we get log(η ij ) = λ + λ i + λ j + λ ij, (1) where λ is related to average frequency, λ i indicates the marginal contribution of A and λ j indicates B s marginal contribution on the expected frequency, and finally λ ij indicates the joint effect of A and B. In this simple case if λ ij = 0 then variables A and B are independent.

More advance analysis of frequency tables (Log-linear models) Log linear models The above generalizes to higher order tables. Consider three variables A, B, and C with f ijk denoting the number observations in cell (i, j, k) (i.e., in the intersection where A s class i, Bs class j, and Cs class k. We are interested on the following models Model A + B + C AB + C AC + B A + BC AB + AC AB + BC AC + BC AB + AC + BC ABC Dependence structure Independence model (only marginal effect) A and B dependent, C independet A and C dependent, B independent A independent, B and C dependent A and B dependent, A and C dependent A and B dependent, B and C dependent A and C dependent, B and C dependent All pairwise dependencies Saturated model

More advance analysis of frequency tables (Log-linear models) Example: SAS CATMOD for log linear models See the SAS example on the course web page for an analysis with empirical data of the dependencies in the above table. As a homework, work out also the example on the course web page.

1 Choice an appropriate method 2 Cross-tabulation More advance analysis of frequency tables (Log-linear models) 3 Regression

Regression The basic regression model is of the form y i = β 0 + β 1 x i1 + β 2 x i2 + + β p x ip + u i (2) where i is the dependent variable and x ij are explanatory variables, u i is the error term, assumed independently and identically distributed (iid), i = 1,..., n (sample size), j = 1,..., p. Slope coefficient β j indicates the marginal effect of variable x j on the dependent variable (given that the other x-variable do not change (ceteris paribus condition), i.e., if x j changes by one unit y is expected to change by β j units. Note that when transformation are applied on the variables, interpretation of the slope coefficients must be adapted accordingly.

Regression Example 2 Using the wage data set referred to on the course web page, estimate we estimate the regression model log(wage) = β 0 +δ 1 female+δ f mfemale+δ mmmale+β 1 educ+β 2 exper+β 3 tenure+u. Questions: 1 Are there wage differences between genders? 2 Is there marriage premiums? 3 Does an additional year of education pay off equally well for women and men?