Missing Data Analysis for the Employee Dataset
67% of the observations have missing values!
Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1 if Y ij is missing R ij = 0 otherwise.
Missing Data Patterns Univariate: Missingness confined to single variables Y 1 Y 2 Y 3 Y 4
Missing Data Patterns Unit Nonresponse: Refuse to answer some variables. Y 1 Y 2 Y 3 Y 4
Missing Data Patterns Monotone (Longitudinal): Missing due to drop outs Y 1 Y 2 Y 3 Y 4
Missing Data Patterns General: Missing values spread throughout. Y 1 Y 2 Y 3 Y 4
Missing Data Patterns Latent Variables: All values of single variable are missing. Y 1 Y 2 Y 3 Y 4
Missing Data Mechanisms (Rubin 1976) 1. Missing Completely at Random (MCAR) [R, Y, ] =[R ][Y ] Parameters governing missing data Parameters of interest 2. Missing at Random (MAR) [R, Y, ] =[R Y obs, ][Y ] 3. Not Missing at Random (NMAR or MNAR) [R, Y, ] =[R Y obs, Y miss, ][Y ]
Missing Data Mechanisms (Rubin 1976) 1. Missing Completely at Random (MCAR) Y =(Y 1,Y 2 ):Y 1 always observed R =(R 1,R 2 ):R 2 B(1, 0.1)
Missing Data Mechanisms (Rubin 1976) 2. Missing at Random (MAR) Y =(Y 1,Y 2 ):Y 1 always observed R =(R 1,R 2 ):R 2 = 1(Y 1 < 1)
Missing Data Mechanisms (Rubin 1976) 3. Not Missing at Random (NMAR) Y =(Y 1,Y 2 ):Y 1 always observed R =(R 1,R 2 ):R 2 = 1(Y 2 < 1)
Missing Data Mechanisms (Rubin 1976) Why do we need to understand the missing data mechanism? If the data is NMAR then the missing data indicators, marginally, contain information about the parameters we are interested in. Y miss R Integrate out missing obs R Y obs Y obs Take home message: if data are NMAR, we have to model the missing data indicators.
Missing Data Mechanisms (Rubin 1976) Why do we need to understand the missing data mechanism? On the other hand, if data are MAR (or MCAR) then missing data indicators won t relate to parameters of interest. Y miss R Integrate out missing obs R Y obs Y obs Take home message: If data are MAR, we don t have to model the missing data indicators but we will need to include the incomplete obs (because of correlation).
Missing Data Mechanisms (Rubin 1976) How can we tell what missing data mechanism is present in the data? No way to tell if NMAR (there is no data) Can distinguish between MCAR and MAR o Fit a logistic regression of missing data indicator on observed data (if MCAR then nothing will be significant) o Compare distribution (via Kolmogorov-Smirnov test or simple t-tests) of observed data when R=1 vs. R=0.
Traditional Missing Data Methods Listwise Deletion: Use only the complete data. Advantages 1. Convenient 2. OK if data is MCAR Disadvantages 1. Bias results 2. Throws away much of the data.
Traditional Missing Data Methods Listwise Deletion: Wastes a lot of data N = Number of Observations P = Number of covariates = Prob. p th covariate is missing Assume R ip iid B(1, ) Case i = Complete B(1, (1 ) P ) # of Complete Cases B(N,(1 ) P ) E(# Complete Cases) = N(1 ) P
Traditional Missing Data Methods π = 0.02, N = 100 Number of Obs 0 20 40 60 80 100 E(# of CC) E(# Thrown Out) 0 10 20 30 40 50 P
Traditional Missing Data Methods Listwise Deletion: Use only the complete data. ˆµ 1 ˆµ 2 MCAR 0.00 0.03 MAR 0.35 0.38 NMAR 0.33 0.41 MCAR MAR NMAR
Traditional Missing Data Methods Mean Imputation: Replace missing variables with mean (or mode) of that particular variable. Advantages 1. Convenient Disadvantages 1. Reduces variability of data 2. Reduces correlations
Traditional Missing Data Methods Mean Imputation: Replace missing variables with mean (or mode) of that particular variable. Y =(Y 1,Y 2 ):Y 1 always observed R =(R 1,R 2 ):R 2 = 1(Y 1 < 1)
Traditional Missing Data Methods Regression Imputation: Use complete cases to fit a regression then replace missing values with predicted values. Advantages 1. Convenient 2. Uses observed data to fill in missing data. Disadvantages 1. Increases correlations 2. Biases in variance estimates
Traditional Missing Data Methods Regression Imputation: Use complete cases to fit a regression then replace missing values with predicted values. Y =(Y 1,Y 2 ):Y 1 always observed R =(R 1,R 2 ):R 2 = 1(Y 1 < 1)
Traditional Missing Data Methods Stochastic Regression Imputation: Use complete cases to fit a regression then replace missing values a draw from prediction distribution. Advantages Disadvantages 1. Convenient 2. Uses observed data to fill in missing data. 1. Decrease standard errors. 3. Produces unbiased estimates of parameters if MAR.
Traditional Missing Data Methods Stochastic Regression Imputation: Use complete cases to fit a regression then replace missing values a draw from prediction distribution.
Traditional Missing Data Methods Hot Deck Imputation: Find K nearest neighbors then replace missing values with mean (or modes) of these nearest neighbors. Advantages Disadvantages 1. Convenient 2. Maintains univariate distributions. 1. Overestimates correlations (particularly when K=1). 3. Slight decrease in standard errors.
Traditional Missing Data Methods Hot Deck Imputation: Find K nearest neighbors then replace missing values with mean (or modes) of these nearest neighbors.
Modeling Missing Data Key Idea to Handling Missing Data: Need a multivariate model for Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 rather than just a univariate response. Common (and extremely useful) Multivariate Tool is the Multivariate Normal Distribution (MVN)
Review of MVN Distribution Let Y =(y 1,...,y P ) 0. If Y follows a multivariate normal (Gaussian) distribution then, Y N P (µ, Y ) ) f Y (y) = 1 2 P/2 1 exp Y 1/2 1 2 (y µ)0 1 Y (y µ) where, µ =(µ 1,...,µ P ) 0 is the mean vector and Y is the covariance matrix.
Review of MVN Distribution Partition, Y = Y1 Y 2, µ = µ1 µ 2, Y = 1 12 2 0 12 The marginal distribution of Y 1 is Y 1 N (µ 1, 1 ). The conditional distribution of Y 1 Y 2 is Y 1 Y 2 N µ 1 2, 1 2 where µ 1 2 = µ 1 + 12 1 2 (Y 2 µ 2 ) 1 2 = 1 12 1 2 0 12
Review of MVN Distribution How to draw from N (µ, ) : 1. Calculate Cholesky Decomposition 2. Draw Z N (0, I) 3. Set Y = µ + LZ = LL 0 mvn.draw <- mu+t(chol(sigma))%*%rnorm(p) Can you show? E(Y) =µ V(Y) =
Regression with the MVN Partition, Y i = yi, µ = X i µy, µ Y = X 2Y YX X 0 YX The conditional distribution of y i X i N y i X i µ y X, is 2 y X where µ y X = µ y + YX 1 X (X i µ X ) = µ y YX 1 X µ X {z } 0 + YX 1 X X i {z } 0 1 = X 0 i
Assessing MVN How do we know if data arise from a multivariate normal distribution? 1. Univariate histograms (or density) 2. Bivariate density estimates 3. Chi-square QQ plot (Y i µ) 0 1 (Y i µ) 2 p
Regression with the MVN Key Points: 1. If yi is MVN, then you get coefficients from X i covariance matrix. 2. Easy to get any conditional distribution (including distribution of x given y) via properties of the MVN. But, what are the MLEs of µ,? ˆµ = 1 X Y i N i ˆ = 1 X (Y i ˆµ)(Y i ˆµ) 0 N i
Maximum Likelihood Estimation with Missing Data Missing Data Log-likelihood: f Y (y ) : Joint dist of ALL data. L( ) = ny i=1 Marg. dist of observed data z } { Z Y i,miss f Y (y i,obs, y i,miss )dy i,miss Space of missing values for observation i (might be discrete).
Maximum Likelihood Estimation with Missing Data Missing Data Log-likelihood (MVN Example): Y N 2 apple 0 0, apple 1 0.9 0.9 1 Y =(Y 1,Y 2 ):Y 1 always observed R =(R 1,R 2 ):R 2 = 1(Y 1 < 1) L(µ) = Y i:r i2 =0 N (Y i µ, ) Y i:r i2 =1 N Y i1 µ 1, 2 1
Maximum Likelihood Estimation with Missing Data Missing Data Log-likelihood (MVN Example):
Maximum Likelihood Estimation with Missing Data Missing Data Log-likelihood (MVN, No Cor, Example):
Maximum Likelihood Estimation with Missing Data How do we maximize the missing data LL? EM algorithm is particularly useful here How do we calculate standard errors from the missing data LL? 1. Asymptotics 2. Bootstrap ˆ d!n(,i 1 (ˆ ))
Maximum Likelihood Estimation with Missing Data Big Issues with MLE Approach 1. Oftentimes, integral is hard to compute. L( ) = ny i=1 Marg. dist of observed data z } { Z Y i,miss f Y (y i,obs, y i,miss )dy i,miss 2. Maximizing complete data likelihood is computationally faster (and sometimes analytically tractable). Solution: Multiple imputation (aka using Bayesian techniques without actually being Bayesian)
Multiple Imputation The three steps of multiple imputation: Imputation Estimation Pooling 0 Data Set 1 Estimate 1 Missing Data Data Set 2 Estimate 2 Final Results...... Data Set M Estimate M
Multiple Imputation The Imputation Step (Algorithm): 1. Choose an initial value of 0. 2. For m=1,,m i. for all i, draw missing values from the conditional distribution ii. set Y (m) i,miss f (y miss y obs, m 1 ) m = arg max L( Y obs, Y (m) miss )
Multiple Imputation The Imputation Step (Algorithm): A MVN Example Y N 2 apple 0 0, apple 1 0.9 0.9 1 Y =(Y 1,Y 2 ):Y 1 always observed R =(R 1,R 2 ):R 2 = 1(Y 1 < 1)
Multiple Imputation The Imputation Step (Algorithm): A MVN Example 1. Set ˆµ 0 and ˆ 0 to be complete case empirical mean and covariance matrix. 2. For m=1,,m i. for all i, draw missing values from the conditional distribution y 2 N µ 2 + 21 2 (y 1 µ 1 ), 1 2 2 2 21 2 1 ii. set ˆµ m = 1 n nx y i ˆ m = 1 n nx (y i ˆµ m )(y i ˆµ m ) 0 i=1 i=1
Multiple Imputation The Imputation Step (Algorithm): Issues to Consider 1. The sequence of parameters and missing data imputations should converge.
Multiple Imputation The Imputation Step (Algorithm): Issues to Consider 2. How do we assess convergence? Trace plots Autocorrelation plots (Stat 651) Effective sample size (Stat 651) Convergence Diagnostics (Stat 651)
Multiple Imputation The Imputation Step (Algorithm): Issues to Consider 3. What do we do if we can t draw from Y (m) i,miss f (y miss y obs, m 1 )? Use Metropolis-Hastings Algorithm (take 651 and you ll learn).
Multiple Imputation The Analysis Phase Calculate the MLEs, SE s, Predictions, etc. (whatever you re interested in) for each imputed dataset
Multiple Imputation The Pooling Phase Pooling parameter estimates MX = ˆ m m=1 Note: this pooled estimate is most appropriate under normality of ˆ m s.
Multiple Imputation The Pooling Phase Pooling standard errors V w = 1 M MX m=1 SE 2 ( m ) V b = 1 M 1 MX 2 ( m ) m=1 V T = V w + V b + V b M ) SE pool = p V T
Multiple Imputation The Pooling Phase Fraction of Missing Information FMI = V b + V b /M V t Hypothesis testing and CIs t = p 0 T VT 1 =(M 1) FMI 2
Approaches for NMAR Selection Model Approach f(y,r) =f(r Y )f(y ) Challenges: 1. Need to relate missing data to missingness indicator so must have strong prior understanding.
Approaches for NMAR Pattern Mixture Approach f(y,r)=f(y R )f(r ) Challenges: 1. Need to relate model parameters to missingness indicator so must have strong prior understanding.
Expectations for Employee Analysis Expectations: 1. Carry out a regression using all the data (use missing data likelihood or multiple imputation). 2. Assume MVN for the whole observation vector.