Missing Data Analysis for the Employee Dataset
67% of the observations have missing values!
Modeling Setup For our analysis goals we would like to do: Y X N (X, 2 I) and then interpret the coefficients to see how happiness relates to job performance. But, we can t use this model because sometimes we are NOT given x s. Solutions: 1. Throw out the incomplete observations 2. Fill in the missing X s and treat as truth 3. Iteratively fill in missing X s as we learn about the relationship between Y and X.
Solution #1 Listwise Deletion: Use only the complete data. Advantages 1. Convenient Disadvantages 1. Biases results if there is a reason it was missing in the first place (e.g. poor performers don t report happiness) 2. Throws away much of the data.
Solution #1 Listwise Deletion: Wastes a lot of data N = Number of Observations P = Number of covariates = Prob. p th covariate is missing Assume Missing Covariate (Y/N) iid B(1, ) Case i = Complete B(1, (1 ) P ) # of Complete Cases B(N,(1 ) P ) E(# Complete Cases) = N(1 ) P
Solution #1 π = 0.02, N = 100 Number of Obs 0 20 40 60 80 100 E(# of CC) E(# Thrown Out) 0 10 20 30 40 50 P
Solution #2: Fill in the missing values How do you want to fill in the missing values? Definition: To fill in = impute Mean Imputation: Replace missing variables with mean (or mode) of that particular variable. Advantages 1. Convenient Disadvantages 1. Reduces variability of data 2. Reduces correlations
Solution #2: Fill in the missing values Mean Imputation: Replace missing variables with mean (or mode) of that particular variable. (X, Y ):Y always observed X Missing if X< 1 2 1 0 1 2 2 1 0 1 2 Complete Dataset x y Missing X 2 1 0 1 2 2 1 0 1 2 Imputed Dataset x y Imputed Values
Solution #2: Fill in the missing values Regression Imputation: Use complete cases to fit a regression then replace missing values with predicted values. Advantages 1. Convenient 2. Uses observed data to fill in missing data. Disadvantages 1. Increases correlations 2. Biases in variance estimates
Solution #2: Fill in the missing values Regression Imputation: Use complete cases to fit a regression then replace missing values with predicted values. (X, Y ):Y always observed X Missing if X< 1 2 1 0 1 2 2 1 0 1 2 Complete Dataset x y Missing X 2 1 0 1 2 2 1 0 1 2 Imputed Dataset x y Imputed Values
Solution #2: Fill in the missing values Stochastic Regression Imputation: Use complete cases to fit a regression then replace missing values with a draw from prediction distribution. Advantages Disadvantages 1. Convenient 2. Uses observed data to fill in missing data. 1. Decrease standard errors. 3. Can produce unbiased estimates.
Solution #2: Fill in the missing values Stochastic Regression Imputation: Use complete cases to fit a regression then replace missing values with a draw from prediction distribution. 2 1 0 1 2 2 1 0 1 2 Complete Dataset x y Missing X 2 1 0 1 2 2 1 0 1 2 Imputed Dataset x y Imputed Values
Rethinking the Analysis What do we need to accomplish the goals of this analysis? We need a method that can simultaneously do the following: 1. Give us the effect of happiness on job performance 2. Help us fill in missing values so we don t have to throw away any data Rather than using our regression hammer on something that is not a nail, lets learn how to use a new tool. The Multivariate Normal Distribution.
Review of MVN Distribution Let Y =(y 1,...,y P ) 0. If Y follows a multivariate normal (Gaussian) distribution then, Y N P (µ, ) ) f Y (y) = 1 2 P/2 1 exp 1/2 1 2 (y µ)0 1 (y µ) where, µ =(µ 1,...,µ P ) 0 is the mean vector and is the covariance matrix. Awesome property: ANY marginal or conditional distribution is normal!
Regression using the MVN Distribution Assume N (µ, ) and partition yi µy Y i =, µ =, = x i µ X Y i iid 2y yx xy x The conditional distribution of y i x i is where y i x i N µ y x, y x µ y x = µ y + yx 1 x (x i µ x ) y x = y 2 yx x 1 xy R 2 = yx x 1 xy 2 y
Regression using the MVN Distribution The conditional distribution of where y X N y x is µ y x, y x µ y x = µ y + yx 1 x (x µ x ) y x = y 2 yx x 1 0 yx So what? Let ( 1,..., P )= yx x 1, 0 = µ y + yx 1 and 2 x = y 2 yx x 1 0 xy then y X N x 0, which is the linear regression we wanted in the first place! 2 µ x
Regression using the MVN Distribution More general result: (pretend Y 1 is your missing data): Y1 µ1 1 Y =, µ =, Y 2 µ Y = 12 2 2 The conditional distribution of Y 1 Y 2 is Y 1 Y 2 N µ 1 2, 1 2 21 where µ 1 2 = µ 1 + 12 1 2 (Y 2 µ 2 ) 1 2 = 1 12 1 2 21
Estimation for the MVN Distribution Estimation for MVN: We can prove (but I won t since there are tons of textbooks that have these proofs in there) that if Then the unbiased estimates are: bµ = 1 nx n b = 1 n 1 Y i iid N (µ, ) i=1 Y i nx (Y i bµ)(y i bµ) 0 i=1 apply(dataset,2,mean) cov(dataset)
MVN Regression Model Multivariate Normal Model for Regression: yi iid N (µ, ) x i So that µ and are the unknown parameters. Why is this useful for the employee dataset? 1. We can still get the regression coefficients we are interested in by looking at the distribution of y x (see previous slides for formulas). 2. We can fill in missing values by drawing from the distribution of missing observed.
Solution 3: Multiple Imputation The three steps of multiple imputation: Imputation Estimation Pooling Incomplete Data Estimate Params Data Set 1... Estimate Params... Final Results Data Set M Estimate Params
Solution 3: Multiple Imputation The Multiple Imputation Algorithm for the MVN Regression Model: 1. Choose an initial value for µ and (just use the complete data initially). 2. For m=1,,m i. Create a new complete dataset by filling in any missing values with draws from the conditional distribution (see previous formulas) ii. Re-estimate the parameters µ and Computation Hint: at each iteration of the imputation phase, just keep parameters you are interested in rather than whole dataset (this saves memory).
Solution 3: Multiple Imputation The Multiple Imputation Algorithm: A MVN Example Y N 2 apple 0 0, apple 1 0.9 0.9 1 2 1 0 1 2 2 1 0 1 2 Y =(Y 1,Y 2 ):Y 1 always observed Y 2 missing if Y 1 < 1
Solution 3: Multiple Imputation The Imputation Step (Algorithm): A MVN Example 1. Set ˆµ 0 and ˆ 0 to be complete case empirical mean and covariance matrix. 2. For m=1,,m i. For each obs, draw missing values from the conditional distribution y 2 N µ 2 + 21 2 (y 1 µ 1 ), 1 2 2 2 21 2 1 i. set ˆµ m = 1 n nx i=1 y i ˆ m = 1 n 1 nx (y i ˆµ m )(y i ˆµ m ) 0 i=1
Solution 3: Multiple Imputation The Imputation Step (Algorithm): Issues to Consider 1. The sequence of parameters and missing data imputations should converge.
Solution 3: Multiple Imputation The Imputation Step (Algorithm): Issues to Consider 2. How do we assess convergence? Subjective Assessment of Trace Plots Convergence Diagnostics (coda package in R) but you ll learn about these in 651 (Bayes).
Solution 3: Multiple Imputation The Pooling Phase (Pool Across results from the M imputed datasets) Pooling parameter estimates (θis parameter you are interested in) MX = ˆ m m=1 Note: this pooled estimate is most appropriate under normality. Use median if skewed.
Solution 3: Multiple Imputation The Pooling Phase Pooling M standard errors V w = 1 M MX m=1 SE 2 ( m ) V b = 1 M 1 MX 2 ( m ) m=1 V T = V w + V b + V b M ) SE pool = p V T
Solution 3: Multiple Imputation The Pooling Phase Fraction of Missing Information FMI = V b + V b /M V t Hypothesis testing and CIs t = p 0 T VT 1 =(M 1) FMI 2 ) ± t?p V T
Points about MVN Approach to Regression A few points: 1. Data has to follow a MVN distribution Univariate histograms have to be normal Bivariate relationships need to be linear 2. Can t use if you have categorical covariates (because those aren t normal). But general idea: define a joint distribution of (Y,X) then fill in missing data from conditionals 3. You can use this even if you don t have missing observations (it s a way to do regression but if you are always given the x s then the other way is easier because you can use lm()). 4. Very useful tool if you don t know which variable is your response variable (or you have multiple response variables see Stat 666).
Expectations for Employee Analysis Expectations: 1. Carry out a regression without discarding the incomplete observations. Justify techniques you use (e.g. if you use mean imputation then tell me why you did it). 2. Justify any assumptions in your model 3. Don t worry about variable selection (just use all the variables since there aren t that many anyway). 4. Describe how well you are explaining job performance using the variables you have. 5. Include uncertainty.