Missing Data Analysis for the Employee Dataset

Similar documents
Missing Data Analysis for the Employee Dataset

Missing Data Missing Data Methods in ML Multiple Imputation

Missing Data. Where did it go?

Handling Data with Three Types of Missing Values:

HANDLING MISSING DATA

Expectation Maximization: Inferring model parameters and class labels

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

Clustering Lecture 5: Mixture Model

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

CHAPTER 1 INTRODUCTION

Missing Data Analysis with SPSS

Missing data analysis. University College London, 2015

Missing Data and Imputation

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

Missing Data: What Are You Missing?

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

K-Means and Gaussian Mixture Models

Multiple Imputation with Mplus

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

WELCOME! Lecture 3 Thommy Perlinger

Multiple imputation using chained equations: Issues and guidance for practice

Epidemiological analysis PhD-course in epidemiology

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

Lecture 26: Missing data

1 Methods for Posterior Simulation

Overview of various smoothers

R software and examples

Amelia multiple imputation in R

Using Machine Learning to Optimize Storage Systems

Markov Chain Monte Carlo (part 1)

Handling missing data for indicators, Susanne Rässler 1

PSY 9556B (Jan8) Design Issues and Missing Data Continued Examples of Simulations for Projects

MS&E 226: Small Data

Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data

Monte Carlo for Spatial Models

Issues in MCMC use for Bayesian model fitting. Practical Considerations for WinBUGS Users

arxiv: v1 [stat.me] 29 May 2015

Markov chain Monte Carlo methods

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Bootstrap and multiple imputation under missing data in AR(1) models

SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITY

Types of missingness and common strategies

- 1 - Fig. A5.1 Missing value analysis dialog box

ESTIMATING THE MISSING VALUES IN ANALYSIS OF VARIANCE TABLES BY A FLEXIBLE ADAPTIVE ARTIFICIAL NEURAL NETWORK AND FUZZY REGRESSION MODELS

Multiple-imputation analysis using Stata s mi command

Missing Data Analysis with the Mahalanobis Distance

The Performance of Multiple Imputation for Likert-type Items with Missing Data

Missing Data Techniques

Missing Data and Imputation

Introduction to Machine Learning CMU-10701

Approaches to Missing Data

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999.

Topics in Machine Learning-EE 5359 Model Assessment and Selection

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background

Clustering web search results

10.4 Linear interpolation method Newton s method

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Time Series Analysis by State Space Methods

Note Set 4: Finite Mixture Models and the EM Algorithm

Analysis of Incomplete Multivariate Data

A STOCHASTIC METHOD FOR ESTIMATING IMPUTATION ACCURACY

Nonparametric Importance Sampling for Big Data

Calibration and emulation of TIE-GCM

Statistical matching: conditional. independence assumption and auxiliary information

Machine Learning Lecture 3

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Machine Learning. Supervised Learning. Manfred Huber

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Statistical Analysis of List Experiments

Statistical Matching using Fractional Imputation

Linear Modeling with Bayesian Statistics

NORM software review: handling missing values with multiple imputation methods 1

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Machine Learning Lecture 3

Supplementary Figure 1. Decoding results broken down for different ROIs

Week 4: Simple Linear Regression II

Machine Learning Lecture 3

Linear Regression and K-Nearest Neighbors 3/28/18

Chapter 2: Looking at Multivariate Data

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Faculty of Sciences. Holger Cevallos Valdiviezo

Expectation Maximization: Inferring model parameters and class labels

ECE 5424: Introduction to Machine Learning

Performance of Sequential Imputation Method in Multilevel Applications

Stat 342 Exam 3 Fall 2014

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

Unsupervised Learning

Estimation of Item Response Models

Supervised vs unsupervised clustering

Unsupervised Learning with Non-Ignorable Missing Data

Comparison of Alternative Imputation Methods for Ordinal Data

Transcription:

Missing Data Analysis for the Employee Dataset

67% of the observations have missing values!

Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1 if Y ij is missing R ij = 0 otherwise.

Missing Data Patterns Univariate: Missingness confined to single variables Y 1 Y 2 Y 3 Y 4

Missing Data Patterns Unit Nonresponse: Refuse to answer some variables. Y 1 Y 2 Y 3 Y 4

Missing Data Patterns Monotone (Longitudinal): Missing due to drop outs Y 1 Y 2 Y 3 Y 4

Missing Data Patterns General: Missing values spread throughout. Y 1 Y 2 Y 3 Y 4

Missing Data Patterns Latent Variables: All values of single variable are missing. Y 1 Y 2 Y 3 Y 4

Missing Data Mechanisms (Rubin 1976) 1. Missing Completely at Random (MCAR) [R, Y, ] =[R ][Y ] Parameters governing missing data Parameters of interest 2. Missing at Random (MAR) [R, Y, ] =[R Y obs, ][Y ] 3. Not Missing at Random (NMAR or MNAR) [R, Y, ] =[R Y obs, Y miss, ][Y ]

Missing Data Mechanisms (Rubin 1976) 1. Missing Completely at Random (MCAR) Y =(Y 1,Y 2 ):Y 1 always observed R =(R 1,R 2 ):R 2 B(1, 0.1)

Missing Data Mechanisms (Rubin 1976) 2. Missing at Random (MAR) Y =(Y 1,Y 2 ):Y 1 always observed R =(R 1,R 2 ):R 2 = 1(Y 1 < 1)

Missing Data Mechanisms (Rubin 1976) 3. Not Missing at Random (NMAR) Y =(Y 1,Y 2 ):Y 1 always observed R =(R 1,R 2 ):R 2 = 1(Y 2 < 1)

Missing Data Mechanisms (Rubin 1976) Why do we need to understand the missing data mechanism? If the data is NMAR then the missing data indicators, marginally, contain information about the parameters we are interested in. Y miss R Integrate out missing obs R Y obs Y obs Take home message: if data are NMAR, we have to model the missing data indicators.

Missing Data Mechanisms (Rubin 1976) Why do we need to understand the missing data mechanism? On the other hand, if data are MAR (or MCAR) then missing data indicators won t relate to parameters of interest. Y miss R Integrate out missing obs R Y obs Y obs Take home message: If data are MAR, we don t have to model the missing data indicators but we will need to include the incomplete obs (because of correlation).

Missing Data Mechanisms (Rubin 1976) How can we tell what missing data mechanism is present in the data? No way to tell if NMAR (there is no data) Can distinguish between MCAR and MAR o Fit a logistic regression of missing data indicator on observed data (if MCAR then nothing will be significant) o Compare distribution (via Kolmogorov-Smirnov test or simple t-tests) of observed data when R=1 vs. R=0.

Traditional Missing Data Methods Listwise Deletion: Use only the complete data. Advantages 1. Convenient 2. OK if data is MCAR Disadvantages 1. Bias results 2. Throws away much of the data.

Traditional Missing Data Methods Listwise Deletion: Wastes a lot of data N = Number of Observations P = Number of covariates = Prob. p th covariate is missing Assume R ip iid B(1, ) Case i = Complete B(1, (1 ) P ) # of Complete Cases B(N,(1 ) P ) E(# Complete Cases) = N(1 ) P

Traditional Missing Data Methods π = 0.02, N = 100 Number of Obs 0 20 40 60 80 100 E(# of CC) E(# Thrown Out) 0 10 20 30 40 50 P

Traditional Missing Data Methods Listwise Deletion: Use only the complete data. ˆµ 1 ˆµ 2 MCAR 0.00 0.03 MAR 0.35 0.38 NMAR 0.33 0.41 MCAR MAR NMAR

Traditional Missing Data Methods Mean Imputation: Replace missing variables with mean (or mode) of that particular variable. Advantages 1. Convenient Disadvantages 1. Reduces variability of data 2. Reduces correlations

Traditional Missing Data Methods Mean Imputation: Replace missing variables with mean (or mode) of that particular variable. Y =(Y 1,Y 2 ):Y 1 always observed R =(R 1,R 2 ):R 2 = 1(Y 1 < 1)

Traditional Missing Data Methods Regression Imputation: Use complete cases to fit a regression then replace missing values with predicted values. Advantages 1. Convenient 2. Uses observed data to fill in missing data. Disadvantages 1. Increases correlations 2. Biases in variance estimates

Traditional Missing Data Methods Regression Imputation: Use complete cases to fit a regression then replace missing values with predicted values. Y =(Y 1,Y 2 ):Y 1 always observed R =(R 1,R 2 ):R 2 = 1(Y 1 < 1)

Traditional Missing Data Methods Stochastic Regression Imputation: Use complete cases to fit a regression then replace missing values a draw from prediction distribution. Advantages Disadvantages 1. Convenient 2. Uses observed data to fill in missing data. 1. Decrease standard errors. 3. Produces unbiased estimates of parameters if MAR.

Traditional Missing Data Methods Stochastic Regression Imputation: Use complete cases to fit a regression then replace missing values a draw from prediction distribution.

Traditional Missing Data Methods Hot Deck Imputation: Find K nearest neighbors then replace missing values with mean (or modes) of these nearest neighbors. Advantages Disadvantages 1. Convenient 2. Maintains univariate distributions. 1. Overestimates correlations (particularly when K=1). 3. Slight decrease in standard errors.

Traditional Missing Data Methods Hot Deck Imputation: Find K nearest neighbors then replace missing values with mean (or modes) of these nearest neighbors.

Modeling Missing Data Key Idea to Handling Missing Data: Need a multivariate model for Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 rather than just a univariate response. Common (and extremely useful) Multivariate Tool is the Multivariate Normal Distribution (MVN)

Review of MVN Distribution Let Y =(y 1,...,y P ) 0. If Y follows a multivariate normal (Gaussian) distribution then, Y N P (µ, Y ) ) f Y (y) = 1 2 P/2 1 exp Y 1/2 1 2 (y µ)0 1 Y (y µ) where, µ =(µ 1,...,µ P ) 0 is the mean vector and Y is the covariance matrix.

Review of MVN Distribution Partition, Y = Y1 Y 2, µ = µ1 µ 2, Y = 1 12 2 0 12 The marginal distribution of Y 1 is Y 1 N (µ 1, 1 ). The conditional distribution of Y 1 Y 2 is Y 1 Y 2 N µ 1 2, 1 2 where µ 1 2 = µ 1 + 12 1 2 (Y 2 µ 2 ) 1 2 = 1 12 1 2 0 12

Review of MVN Distribution How to draw from N (µ, ) : 1. Calculate Cholesky Decomposition 2. Draw Z N (0, I) 3. Set Y = µ + LZ = LL 0 mvn.draw <- mu+t(chol(sigma))%*%rnorm(p) Can you show? E(Y) =µ V(Y) =

Regression with the MVN Partition, Y i = yi, µ = X i µy, µ Y = X 2Y YX X 0 YX The conditional distribution of y i X i N y i X i µ y X, is 2 y X where µ y X = µ y + YX 1 X (X i µ X ) = µ y YX 1 X µ X {z } 0 + YX 1 X X i {z } 0 1 = X 0 i

Assessing MVN How do we know if data arise from a multivariate normal distribution? 1. Univariate histograms (or density) 2. Bivariate density estimates 3. Chi-square QQ plot (Y i µ) 0 1 (Y i µ) 2 p

Regression with the MVN Key Points: 1. If yi is MVN, then you get coefficients from X i covariance matrix. 2. Easy to get any conditional distribution (including distribution of x given y) via properties of the MVN. But, what are the MLEs of µ,? ˆµ = 1 X Y i N i ˆ = 1 X (Y i ˆµ)(Y i ˆµ) 0 N i

Maximum Likelihood Estimation with Missing Data Missing Data Log-likelihood: f Y (y ) : Joint dist of ALL data. L( ) = ny i=1 Marg. dist of observed data z } { Z Y i,miss f Y (y i,obs, y i,miss )dy i,miss Space of missing values for observation i (might be discrete).

Maximum Likelihood Estimation with Missing Data Missing Data Log-likelihood (MVN Example): Y N 2 apple 0 0, apple 1 0.9 0.9 1 Y =(Y 1,Y 2 ):Y 1 always observed R =(R 1,R 2 ):R 2 = 1(Y 1 < 1) L(µ) = Y i:r i2 =0 N (Y i µ, ) Y i:r i2 =1 N Y i1 µ 1, 2 1

Maximum Likelihood Estimation with Missing Data Missing Data Log-likelihood (MVN Example):

Maximum Likelihood Estimation with Missing Data Missing Data Log-likelihood (MVN, No Cor, Example):

Maximum Likelihood Estimation with Missing Data How do we maximize the missing data LL? EM algorithm is particularly useful here How do we calculate standard errors from the missing data LL? 1. Asymptotics 2. Bootstrap ˆ d!n(,i 1 (ˆ ))

Maximum Likelihood Estimation with Missing Data Big Issues with MLE Approach 1. Oftentimes, integral is hard to compute. L( ) = ny i=1 Marg. dist of observed data z } { Z Y i,miss f Y (y i,obs, y i,miss )dy i,miss 2. Maximizing complete data likelihood is computationally faster (and sometimes analytically tractable). Solution: Multiple imputation (aka using Bayesian techniques without actually being Bayesian)

Multiple Imputation The three steps of multiple imputation: Imputation Estimation Pooling 0 Data Set 1 Estimate 1 Missing Data Data Set 2 Estimate 2 Final Results...... Data Set M Estimate M

Multiple Imputation The Imputation Step (Algorithm): 1. Choose an initial value of 0. 2. For m=1,,m i. for all i, draw missing values from the conditional distribution ii. set Y (m) i,miss f (y miss y obs, m 1 ) m = arg max L( Y obs, Y (m) miss )

Multiple Imputation The Imputation Step (Algorithm): A MVN Example Y N 2 apple 0 0, apple 1 0.9 0.9 1 Y =(Y 1,Y 2 ):Y 1 always observed R =(R 1,R 2 ):R 2 = 1(Y 1 < 1)

Multiple Imputation The Imputation Step (Algorithm): A MVN Example 1. Set ˆµ 0 and ˆ 0 to be complete case empirical mean and covariance matrix. 2. For m=1,,m i. for all i, draw missing values from the conditional distribution y 2 N µ 2 + 21 2 (y 1 µ 1 ), 1 2 2 2 21 2 1 ii. set ˆµ m = 1 n nx y i ˆ m = 1 n nx (y i ˆµ m )(y i ˆµ m ) 0 i=1 i=1

Multiple Imputation The Imputation Step (Algorithm): Issues to Consider 1. The sequence of parameters and missing data imputations should converge.

Multiple Imputation The Imputation Step (Algorithm): Issues to Consider 2. How do we assess convergence? Trace plots Autocorrelation plots (Stat 651) Effective sample size (Stat 651) Convergence Diagnostics (Stat 651)

Multiple Imputation The Imputation Step (Algorithm): Issues to Consider 3. What do we do if we can t draw from Y (m) i,miss f (y miss y obs, m 1 )? Use Metropolis-Hastings Algorithm (take 651 and you ll learn).

Multiple Imputation The Analysis Phase Calculate the MLEs, SE s, Predictions, etc. (whatever you re interested in) for each imputed dataset

Multiple Imputation The Pooling Phase Pooling parameter estimates MX = ˆ m m=1 Note: this pooled estimate is most appropriate under normality of ˆ m s.

Multiple Imputation The Pooling Phase Pooling standard errors V w = 1 M MX m=1 SE 2 ( m ) V b = 1 M 1 MX 2 ( m ) m=1 V T = V w + V b + V b M ) SE pool = p V T

Multiple Imputation The Pooling Phase Fraction of Missing Information FMI = V b + V b /M V t Hypothesis testing and CIs t = p 0 T VT 1 =(M 1) FMI 2

Approaches for NMAR Selection Model Approach f(y,r) =f(r Y )f(y ) Challenges: 1. Need to relate missing data to missingness indicator so must have strong prior understanding.

Approaches for NMAR Pattern Mixture Approach f(y,r)=f(y R )f(r ) Challenges: 1. Need to relate model parameters to missingness indicator so must have strong prior understanding.

Expectations for Employee Analysis Expectations: 1. Carry out a regression using all the data (use missing data likelihood or multiple imputation). 2. Assume MVN for the whole observation vector.