Missing Data. Where did it go?

Size: px

Start display at page:

Download "Missing Data. Where did it go?"

Buddy Day
5 years ago
Views:

1 Missing Data Where did it go? 1

2 Learning Objectives High-level discussion of some techniques Identify type of missingness Single vs Multiple Imputation My favourite technique 2

3 Problem Uh data are missing Cells that should have numbers or factors have instead {NA, DQ,,., 999 } Could be the response variable or a predictor If the response is missing but the missingness is informative, you may have censored data See slides on Time to Event Analysis 3

4 Types of Missingness MCAR: Missing Completely At Random aka. Uniform non-response Missing values are independent of observed values AND unobserved measurements Corruption of data file Dropped test tube Instrument failure (unrelated to measurement) PPPP rr yy oo, yy mm = PPPP(rr) 5

5 Types of Missingness MAR: Missing At Random aka. Ignorable (Likelihood methods still valid) Missing values are independent of unobserved measurements, given observed data Very difficult to determine this type Income missing from a survey, but we have info on property tax paid. Given property tax band, income is missing at random. PPPP rr yy oo, yy mm = PPPP(rr yy oo ) 6

6 Types of Missingness NMAR: Not Missing At Random aka. Non-Ignorable Missing data depend on the missing values themselves Income missing from a survey You can t really tell when you have this type either (from the data alone), although you might be able to guess by thinking about it 7

7 What should we do? Quantify the missingness Find out what percentage of each variable is missing. Figure out the likely type of missingness You ll have to put on your detective hat Decide how much time you want to spend (if any) on imputing the data 8

8 Exercise Pair up Try to come up with an original example of each of the three types of missingness: MCAR MAR NMAR Bonus if they are all of a similar theme No actual bonus will be given 9

9 Imputation Paradigms Remove variables with high missing counts Pro: Simple, cheap, instant Con: May miss relevant predictors Ignore it (Completer analysis) Pro: Simple, cheap, instant Con: Biased (unless MCAR), inefficient Single Imputation Pro: Simple, fast Con: Inflation of Type I errors, biased for NMAR Multiple Imputation Pro: Unbiased, correct standard errors Con: Complicated and time-consuming, may not converge for large datasets, more storage needed

10 Single Imputation Impute the missing value with a single estimate Overall sample mean/median/min/max Biased when data not MCAR Destroys correlations LOCF (or LVCF) Not possible in CS studies, biased even under MCAR Hot Deck Imputation Obsolete now with greater computing power Difficult to accomplish with many demographics Predicted Mean Inflation of Type I errors, correlations Stochastic regression is better, but not perfect

11 K Nearest Neighbours Impute the missing value with a single estimate using the K nearest cases from the dataset (mean or mode) Uses a lot of information when calculating distance Could use the entire dataset if desired Can deal with other missing values Can be computationally (time) intensive, but not difficult to implement Shown to have good performance, and wellrespected in the ML community

12 Multiple Imputation Impute the missing value with several estimates proc MI Valid under MAR and NMAR assumptions (with some modification) Much more storage needed as one must store multiple datasets (~ 5-10) Tricky to obtain a Frequentist interpretation of results since it uses a Bayesian method Built-in SAS procedure

13 Alternatives to Imputation EM Algorithm Allows ML Estimates of regression parameters to be made, but does not actually impute values Slow to converge, requires MAR assumption GEEs and GLMMs Also does not give imputed values explicitly Very complex likelihood function which is difficult to solve (ie. MC simulation) Requires MAR assumption

14 Alternatives to Imputation IPW (Inverse Probability Weighting) method Complicated Requires MAR assumption Classifiers (RIPPER, Naive-Bayes, SVM) Classifiers (CN2, C4.5) Require proper training data with missing values

15 My favourite Method (K-NN) Impute the missing value with a single estimate using the K nearest cases from the dataset (mean or mode) Numerical covariates One: Easy, just use the K closest cases (absolute distance) that don t have missing values Many: Can use Euclidean distance instead, or any other p-norm for that matter (ie. Manhattan) See also Mahalanobis distance Probably need to standardize first Sometimes you DON T want to do this What should K be? Not too small or too large. Maybe 5 K 10; or choose by cross-validation Doesn t seem to matter too much in practice

16 K-NN Impute the missing value with a single estimate using the K nearest cases from the dataset (mean or mode) Categorical predictors Can assign a distance metric for each 1 for same category, 0 for other May want to modify this slightly for ordinal variables May want to remove predictors that aren t useful before calculating distance (univariate LMs) Can also use a weighting matrix if you want to give more importance to some predictors Theoretically this could have a huge effect, but in practice it doesn t seem to matter much

17 K-NN Impute the missing value with a single estimate using the K nearest cases from the dataset (mean or mode) Mixed predictors Need some way to realize a distance metric I usually just play with the weightings based on what I know about the data I m sure there s probably a better way to do it K-NN in general Many modifications, tweaks, speedups, adaptations, robustness-enhancers, and generalizations exist Literature search to find them!

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1