Missing Data Analysis for the Employee Dataset

Similar documents
Missing Data Analysis for the Employee Dataset

Missing Data Missing Data Methods in ML Multiple Imputation

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

Estimation of Item Response Models

Section 2.3: Simple Linear Regression: Predictions and Inference

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24

Multivariate probability distributions

Overview of various smoothers

Missing Data. Where did it go?

MS&E 226: Small Data

Lecture 26: Missing data

Missing Data and Imputation

Expectation Maximization: Inferring model parameters and class labels

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Markov Chain Monte Carlo (part 1)

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Chapters 5-6: Statistical Inference Methods

WELCOME! Lecture 3 Thommy Perlinger

Performance Evaluation

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Exploring Data. This guide describes the facilities in SPM to gain initial insights about a dataset by viewing and generating descriptive statistics.

Network Traffic Measurements and Analysis

Week 4: Simple Linear Regression III

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Issues in MCMC use for Bayesian model fitting. Practical Considerations for WinBUGS Users

Statistical foundations of Machine Learning INFO-F-422 TP: Linear Regression

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Multiple imputation using chained equations: Issues and guidance for practice

CHAPTER 2 DESCRIPTIVE STATISTICS

Statistical matching: conditional. independence assumption and auxiliary information

predict and Friends: Common Methods for Predictive Models in R , Spring 2015 Handout No. 1, 25 January 2015

STA 570 Spring Lecture 5 Tuesday, Feb 1

Clustering web search results

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Expectation-Maximization Methods in Population Analysis. Robert J. Bauer, Ph.D. ICON plc.

Evaluating Machine-Learning Methods. Goals for the lecture

Multiple-imputation analysis using Stata s mi command

Probabilistic Graphical Models

Mixture Models and the EM Algorithm

BIOL 458 BIOMETRY Lab 10 - Multiple Regression

CS281 Section 9: Graph Models and Practical MCMC

Instance-based Learning

Notes on Simulations in SAS Studio

Missing Data and Imputation

Chapter 3 Analyzing Normal Quantitative Data

Linear Regression & Gradient Descent

Note Set 4: Finite Mixture Models and the EM Algorithm

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Introduction to Mixed Models: Multivariate Regression

arxiv: v1 [stat.me] 29 May 2015

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

1 Methods for Posterior Simulation

We deliver Global Engineering Solutions. Efficiently. This page contains no technical data Subject to the EAR or the ITAR

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Regression Analysis and Linear Regression Models

Multiple Imputation with Mplus

Fall 09, Homework 5

Supplementary Figure 1. Decoding results broken down for different ROIs

8. MINITAB COMMANDS WEEK-BY-WEEK

Univariate Extreme Value Analysis. 1 Block Maxima. Practice problems using the extremes ( 2.0 5) package. 1. Pearson Type III distribution

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Moving Beyond Linearity

Monte Carlo Integration

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

rpms: An R Package for Modeling Survey Data with Regression Trees

Week 5: Multiple Linear Regression II

Linear Modeling with Bayesian Statistics

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

CSC 411: Lecture 02: Linear Regression

GRETL FOR TODDLERS!! CONTENTS. 1. Access to the econometric software A new data set: An existent data set: 3

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

Monte Carlo for Spatial Models

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Short-Cut MCMC: An Alternative to Adaptation

Chapter 8 The C 4.5*stat algorithm

Statistical Pattern Recognition

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Probability Model for 2 RV s

Section 7D Systems of Linear Equations

piecewise ginireg 1 Piecewise Gini Regressions in Stata Jan Ditzen 1 Shlomo Yitzhaki 2 September 8, 2017

Workshop 8: Model selection

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Amelia multiple imputation in R

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Missing Data Techniques

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

MultiVariate Bayesian (MVB) decoding of brain images

NORM software review: handling missing values with multiple imputation methods 1

R software and examples

Chapter 2: Looking at Multivariate Data

Additive hedonic regression models for the Austrian housing market ERES Conference, Edinburgh, June

STATS PAD USER MANUAL

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

Evaluating Machine Learning Methods: Part 1

Transcription:

Missing Data Analysis for the Employee Dataset

67% of the observations have missing values!

Modeling Setup For our analysis goals we would like to do: Y X N (X, 2 I) and then interpret the coefficients to see how happiness relates to job performance. But, we can t use this model because sometimes we are NOT given x s. Solutions: 1. Throw out the incomplete observations 2. Fill in the missing X s and treat as truth 3. Iteratively fill in missing X s as we learn about the relationship between Y and X.

Solution #1 Listwise Deletion: Use only the complete data. Advantages 1. Convenient Disadvantages 1. Biases results if there is a reason it was missing in the first place (e.g. poor performers don t report happiness) 2. Throws away much of the data.

Solution #1 Listwise Deletion: Wastes a lot of data N = Number of Observations P = Number of covariates = Prob. p th covariate is missing Assume Missing Covariate (Y/N) iid B(1, ) Case i = Complete B(1, (1 ) P ) # of Complete Cases B(N,(1 ) P ) E(# Complete Cases) = N(1 ) P

Solution #1 π = 0.02, N = 100 Number of Obs 0 20 40 60 80 100 E(# of CC) E(# Thrown Out) 0 10 20 30 40 50 P

Solution #2: Fill in the missing values How do you want to fill in the missing values? Definition: To fill in = impute Mean Imputation: Replace missing variables with mean (or mode) of that particular variable. Advantages 1. Convenient Disadvantages 1. Reduces variability of data 2. Reduces correlations

Solution #2: Fill in the missing values Mean Imputation: Replace missing variables with mean (or mode) of that particular variable. (X, Y ):Y always observed X Missing if X< 1 2 1 0 1 2 2 1 0 1 2 Complete Dataset x y Missing X 2 1 0 1 2 2 1 0 1 2 Imputed Dataset x y Imputed Values

Solution #2: Fill in the missing values Regression Imputation: Use complete cases to fit a regression then replace missing values with predicted values. Advantages 1. Convenient 2. Uses observed data to fill in missing data. Disadvantages 1. Increases correlations 2. Biases in variance estimates

Solution #2: Fill in the missing values Regression Imputation: Use complete cases to fit a regression then replace missing values with predicted values. (X, Y ):Y always observed X Missing if X< 1 2 1 0 1 2 2 1 0 1 2 Complete Dataset x y Missing X 2 1 0 1 2 2 1 0 1 2 Imputed Dataset x y Imputed Values

Solution #2: Fill in the missing values Stochastic Regression Imputation: Use complete cases to fit a regression then replace missing values with a draw from prediction distribution. Advantages Disadvantages 1. Convenient 2. Uses observed data to fill in missing data. 1. Decrease standard errors. 3. Can produce unbiased estimates.

Solution #2: Fill in the missing values Stochastic Regression Imputation: Use complete cases to fit a regression then replace missing values with a draw from prediction distribution. 2 1 0 1 2 2 1 0 1 2 Complete Dataset x y Missing X 2 1 0 1 2 2 1 0 1 2 Imputed Dataset x y Imputed Values

Rethinking the Analysis What do we need to accomplish the goals of this analysis? We need a method that can simultaneously do the following: 1. Give us the effect of happiness on job performance 2. Help us fill in missing values so we don t have to throw away any data Rather than using our regression hammer on something that is not a nail, lets learn how to use a new tool. The Multivariate Normal Distribution.

Review of MVN Distribution Let Y =(y 1,...,y P ) 0. If Y follows a multivariate normal (Gaussian) distribution then, Y N P (µ, ) ) f Y (y) = 1 2 P/2 1 exp 1/2 1 2 (y µ)0 1 (y µ) where, µ =(µ 1,...,µ P ) 0 is the mean vector and is the covariance matrix. Awesome property: ANY marginal or conditional distribution is normal!

Regression using the MVN Distribution Assume N (µ, ) and partition yi µy Y i =, µ =, = x i µ X Y i iid 2y yx xy x The conditional distribution of y i x i is where y i x i N µ y x, y x µ y x = µ y + yx 1 x (x i µ x ) y x = y 2 yx x 1 xy R 2 = yx x 1 xy 2 y

Regression using the MVN Distribution The conditional distribution of where y X N y x is µ y x, y x µ y x = µ y + yx 1 x (x µ x ) y x = y 2 yx x 1 0 yx So what? Let ( 1,..., P )= yx x 1, 0 = µ y + yx 1 and 2 x = y 2 yx x 1 0 xy then y X N x 0, which is the linear regression we wanted in the first place! 2 µ x

Regression using the MVN Distribution More general result: (pretend Y 1 is your missing data): Y1 µ1 1 Y =, µ =, Y 2 µ Y = 12 2 2 The conditional distribution of Y 1 Y 2 is Y 1 Y 2 N µ 1 2, 1 2 21 where µ 1 2 = µ 1 + 12 1 2 (Y 2 µ 2 ) 1 2 = 1 12 1 2 21

Estimation for the MVN Distribution Estimation for MVN: We can prove (but I won t since there are tons of textbooks that have these proofs in there) that if Then the unbiased estimates are: bµ = 1 nx n b = 1 n 1 Y i iid N (µ, ) i=1 Y i nx (Y i bµ)(y i bµ) 0 i=1 apply(dataset,2,mean) cov(dataset)

MVN Regression Model Multivariate Normal Model for Regression: yi iid N (µ, ) x i So that µ and are the unknown parameters. Why is this useful for the employee dataset? 1. We can still get the regression coefficients we are interested in by looking at the distribution of y x (see previous slides for formulas). 2. We can fill in missing values by drawing from the distribution of missing observed.

Solution 3: Multiple Imputation The three steps of multiple imputation: Imputation Estimation Pooling Incomplete Data Estimate Params Data Set 1... Estimate Params... Final Results Data Set M Estimate Params

Solution 3: Multiple Imputation The Multiple Imputation Algorithm for the MVN Regression Model: 1. Choose an initial value for µ and (just use the complete data initially). 2. For m=1,,m i. Create a new complete dataset by filling in any missing values with draws from the conditional distribution (see previous formulas) ii. Re-estimate the parameters µ and Computation Hint: at each iteration of the imputation phase, just keep parameters you are interested in rather than whole dataset (this saves memory).

Solution 3: Multiple Imputation The Multiple Imputation Algorithm: A MVN Example Y N 2 apple 0 0, apple 1 0.9 0.9 1 2 1 0 1 2 2 1 0 1 2 Y =(Y 1,Y 2 ):Y 1 always observed Y 2 missing if Y 1 < 1

Solution 3: Multiple Imputation The Imputation Step (Algorithm): A MVN Example 1. Set ˆµ 0 and ˆ 0 to be complete case empirical mean and covariance matrix. 2. For m=1,,m i. For each obs, draw missing values from the conditional distribution y 2 N µ 2 + 21 2 (y 1 µ 1 ), 1 2 2 2 21 2 1 i. set ˆµ m = 1 n nx i=1 y i ˆ m = 1 n 1 nx (y i ˆµ m )(y i ˆµ m ) 0 i=1

Solution 3: Multiple Imputation The Imputation Step (Algorithm): Issues to Consider 1. The sequence of parameters and missing data imputations should converge.

Solution 3: Multiple Imputation The Imputation Step (Algorithm): Issues to Consider 2. How do we assess convergence? Subjective Assessment of Trace Plots Convergence Diagnostics (coda package in R) but you ll learn about these in 651 (Bayes).

Solution 3: Multiple Imputation The Pooling Phase (Pool Across results from the M imputed datasets) Pooling parameter estimates (θis parameter you are interested in) MX = ˆ m m=1 Note: this pooled estimate is most appropriate under normality. Use median if skewed.

Solution 3: Multiple Imputation The Pooling Phase Pooling M standard errors V w = 1 M MX m=1 SE 2 ( m ) V b = 1 M 1 MX 2 ( m ) m=1 V T = V w + V b + V b M ) SE pool = p V T

Solution 3: Multiple Imputation The Pooling Phase Fraction of Missing Information FMI = V b + V b /M V t Hypothesis testing and CIs t = p 0 T VT 1 =(M 1) FMI 2 ) ± t?p V T

Points about MVN Approach to Regression A few points: 1. Data has to follow a MVN distribution Univariate histograms have to be normal Bivariate relationships need to be linear 2. Can t use if you have categorical covariates (because those aren t normal). But general idea: define a joint distribution of (Y,X) then fill in missing data from conditionals 3. You can use this even if you don t have missing observations (it s a way to do regression but if you are always given the x s then the other way is easier because you can use lm()). 4. Very useful tool if you don t know which variable is your response variable (or you have multiple response variables see Stat 666).

Expectations for Employee Analysis Expectations: 1. Carry out a regression without discarding the incomplete observations. Justify techniques you use (e.g. if you use mean imputation then tell me why you did it). 2. Justify any assumptions in your model 3. Don t worry about variable selection (just use all the variables since there aren t that many anyway). 4. Describe how well you are explaining job performance using the variables you have. 5. Include uncertainty.