Missing Data and Imputation

Size: px

Start display at page:

Download "Missing Data and Imputation"

Sharon Norton
5 years ago
Views:

1 Missing Data and Imputation Hoff Chapter 7, GH Chapter 25 April 21, 2017

2 Bednets and Malaria Y:presence or absence of parasites in a blood smear AGE: age of child BEDNET: bed net use (exposure) GREEN:greenness of the surrounding vegetation based on satellite photography PHC: whether a village is part of a primary health-care system

3 Bednets and Malaria malaria = readcsv("gambiadat", header=true) summary(malaria) Y AGE BEDNET GREEN Min :00000 Min :1000 Min :00000 Min :2885 Min 1st Qu: st Qu:1000 1st Qu: st Qu:4085 1st Q Median :00000 Median :2000 Median :10000 Median :4085 Media Mean :03093 Mean :2399 Mean :07049 Mean :3984 Mean 3rd Qu: rd Qu:3000 3rd Qu: rd Qu:4085 3rd Q Max :10000 Max :4000 Max :10000 Max :4765 Max NA's :317 39% missing

4 More about missingness Consider Probability of missingness - are certain groups more likely to have missing data?

5 More about missingness Consider Probability of missingness - are certain groups more likely to have missing data? Are certain responses more likely to be missing? (ie individuals with high income are more likely to not report it) probability of missing depends on value of outcome

6 More about missingness Consider Probability of missingness - are certain groups more likely to have missing data? Are certain responses more likely to be missing? (ie individuals with high income are more likely to not report it) probability of missing depends on value of outcome Analysis depends on assumptions about missingness

7 Mechanisms for Missingness Missing Completely at random (MCAR): missingness does not depend on outcome or other variables

8 Mechanisms for Missingness Missing Completely at random (MCAR): missingness does not depend on outcome or other variables Missing at Random: missing does not depend on value of variable, but may depend on other variables

9 Mechanisms for Missingness Missing Completely at random (MCAR): missingness does not depend on outcome or other variables Missing at Random: missing does not depend on value of variable, but may depend on other variables Missing Not at Random: missingness depends on the variable that is missing

10 Mechanisms for Missingness Missing Completely at random (MCAR): missingness does not depend on outcome or other variables Missing at Random: missing does not depend on value of variable, but may depend on other variables Missing Not at Random: missingness depends on the variable that is missing Cannot tell from data

11 Modeling Delete subjects with any missing observations This would remove 39 % of the data and reduces power Induces Bias if data are not missing completely at random!

12 Modeling Delete subjects with any missing observations This would remove 39 % of the data and reduces power Induces Bias if data are not missing completely at random! Replace each missing value with an estimated mean (plug-in approach) This implies that we are certain about the values of the missing cases, so any measures of uncertainty in parameter estimates are overly optimistic (too narrow) Distorts correlation structure in data

13 Modeling Delete subjects with any missing observations This would remove 39 % of the data and reduces power Induces Bias if data are not missing completely at random! Replace each missing value with an estimated mean (plug-in approach) This implies that we are certain about the values of the missing cases, so any measures of uncertainty in parameter estimates are overly optimistic (too narrow) Distorts correlation structure in data Work with likelihoods based on observed data; this will be a product of marginal distributions, difficult to work with

14 Modeling Delete subjects with any missing observations This would remove 39 % of the data and reduces power Induces Bias if data are not missing completely at random! Replace each missing value with an estimated mean (plug-in approach) This implies that we are certain about the values of the missing cases, so any measures of uncertainty in parameter estimates are overly optimistic (too narrow) Distorts correlation structure in data Work with likelihoods based on observed data; this will be a product of marginal distributions, difficult to work with Model Based Methods

15 Observed Data (Y i,1, Y i,2, Y i,3, Y i,4, Y i,5 ) (O i,1, O i,2, O i,3, O i,4, O i,5 )

16 Observed Data (Y i,1, Y i,2, Y i,3, Y i,4, Y i,5 ) (O i,1, O i,2, O i,3, O i,4, O i,5 ) where O i,j is 1 if Y i,j is observed and O i,j is 0 if Y i,j is missing

17 Observed Data (Y i,1, Y i,2, Y i,3, Y i,4, Y i,5 ) (O i,1, O i,2, O i,3, O i,4, O i,5 ) where O i,j is 1 if Y i,j is observed and O i,j is 0 if Y i,j is missing Missing at Random Data: O i and Y i are independent given θ

18 Observed Data (Y i,1, Y i,2, Y i,3, Y i,4, Y i,5 ) (O i,1, O i,2, O i,3, O i,4, O i,5 ) where O i,j is 1 if Y i,j is observed and O i,j is 0 if Y i,j is missing Missing at Random Data: O i and Y i are independent given θ distribution for O i does not depend on θ

19 Observed Data (Y i,1, Y i,2, Y i,3, Y i,4, Y i,5 ) (O i,1, O i,2, O i,3, O i,4, O i,5 ) where O i,j is 1 if Y i,j is observed and O i,j is 0 if Y i,j is missing Missing at Random Data: O i and Y i are independent given θ distribution for O i does not depend on θ Marginal Model for observed data p(o i, y[o i = 1] θ) = p(o i )p(y[o i = 1] θ) = p(o i ) p(y i,1, y i,2, y i,3, y i,4, y i,5 θ) y i,j o i,j =0 dy i,j

20 Observed Data (Y i,1, Y i,2, Y i,3, Y i,4, Y i,5 ) (O i,1, O i,2, O i,3, O i,4, O i,5 ) where O i,j is 1 if Y i,j is observed and O i,j is 0 if Y i,j is missing Missing at Random Data: O i and Y i are independent given θ distribution for O i does not depend on θ Marginal Model for observed data p(o i, y[o i = 1] θ) = p(o i )p(y[o i = 1] θ) = p(o i ) p(y i,1, y i,2, y i,3, y i,4, y i,5 θ) Integrate over the missing variables to obtain the likelihood y i,j o i,j =0 dy i,j

21 Use the Gibbs Sampler to Integrate If we had complete data then we would draw θ from the condition distribution of θ Y class for sampling µ and Σ Add a step at each iteration to generate the missing data:

22 Use the Gibbs Sampler to Integrate If we had complete data then we would draw θ from the condition distribution of θ Y class for sampling µ and Σ Add a step at each iteration to generate the missing data: Generate Y (t+1) miss from p(y miss Y obs, θ (t) ) and fill in the missing data to obtain a complete matrix Y from Y obs and Y miss

23 Use the Gibbs Sampler to Integrate If we had complete data then we would draw θ from the condition distribution of θ Y class for sampling µ and Σ Add a step at each iteration to generate the missing data: Generate Y (t+1) miss from p(y miss Y obs, θ (t) ) and fill in the missing data to obtain a complete matrix Y from Y obs and Y miss Generate θ (t+1) from p(θ Y obs, Y (t+1) miss, )

24 Use the Gibbs Sampler to Integrate If we had complete data then we would draw θ from the condition distribution of θ Y class for sampling µ and Σ Add a step at each iteration to generate the missing data: Generate Y (t+1) miss from p(y miss Y obs, θ (t) ) and fill in the missing data to obtain a complete matrix Y from Y obs and Y miss Generate θ (t+1) from p(θ Y obs, Y (t+1) miss, ) Averaging over the draws of Y miss integrates marginalizes over the missing dimensions

25 JAGS Model model = function() { for (i in 1:N) { Y[i] ~ dbern(p[i]) logit(p[i]) <- alpha + betaage*age[i] + betabednet*bednet[i] +betagreen*green[i] + betaphc*phc[i] } # model for missing exposure variable for (i in 1:N) { BEDNET[i] ~ dbern(q) #prior model for whether or not child # sleeps under treated bednet } #uniform prior (uniform) on prob of sleeping under bednet q ~ dbeta (1,1) #vague priors on regression coefficients alpha ~ dnorm(0, ) betaage ~ dnorm(0, ) betabednet ~ dnorm(0, ) betagreen ~ dnorm(0, ) betaphc ~ dnorm(0, ) # calculate odds ratios of interest ORbednet <- exp(betabednet) #OR of malaria for children using bednet }

26 Posterior Density theta = asdataframe(sim$bugsoutput$simsmatrix) plot(density(theta[,1]), xlab="or Bednet", main="") OR Bednet Density

27 JAGS Model model2 = function() { for (i in 1:N) { Y[i] ~ dbern(p[i]) logit(p[i]) <- alpha + betaage*age[i] + betabednet*bednet[i] +betagreen*green[i] + betaphc*phc[i] } # model for missing exposure variable for (i in 1:N) { BEDNET[i] ~ dbern(q[i]) #prior model for bednet use logit(q[i]) <- gamma[1] + gamma[2]*phc[i] #allow prob depend on PHC } #vague priors on regression coefficients gamma[1] ~ dnorm(0, ) gamma[2] ~ dnorm(0, ) alpha ~ dnorm(0, ) betaage ~ dnorm(0, ) betabednet ~ dnorm(0, ) betagreen ~ dnorm(0, ) betaphc ~ dnorm(0, ) # calculate odds ratios of interest ORbednet <- exp(betabednet) #OR of malaria for children using bednet

28 Posterior Density thetaphc = asdataframe(simphc$bugsoutput$simsmatrix) plot(density(thetaphc[,1]), xlab="or Malaria Bednet", main="") OR Malaria Bednet Density

29 Posterior Density plot(density(thetaphc[,"orbednetphc"]), xlab="or BEDNET PHC", main="" OR BEDNET PHC Density

30 intervals exp(confint(glm(y ~, data=malaria, family=binomial), parm="bednet")) 25 % 975 % HPDinterval(asmcmc(theta)) lower upper ORbednet betabednet deviance attr(,"probability") [1] 095 HPDinterval(asmcmc(thetaphc)) lower upper ORbednet ORbednetPHC deviance attr(,"probability")

31 More than one variable with missing data Model each predictor (joint distribution)

32 More than one variable with missing data Model each predictor (joint distribution) Coherent sequential model of conditional distributions

33 More than one variable with missing data Model each predictor (joint distribution) Coherent sequential model of conditional distributions Handle Mix of Discrete and Continuous

34 More than one variable with missing data Model each predictor (joint distribution) Coherent sequential model of conditional distributions Handle Mix of Discrete and Continuous Categorical: Continuation Ratios easiest

35 More than one variable with missing data Model each predictor (joint distribution) Coherent sequential model of conditional distributions Handle Mix of Discrete and Continuous Categorical: Continuation Ratios easiest

36 Missing Not at Random probability of missing depends on predictor

37 Missing Not at Random probability of missing depends on predictor need to model joint missingness indicator and outcomes

38 Missing Not at Random probability of missing depends on predictor need to model joint missingness indicator and outcomes model missingness given variables

39 Missing Not at Random probability of missing depends on predictor need to model joint missingness indicator and outcomes model missingness given variables need more information!

40 Summary Make sure you know how missing data are coded!

41 Summary Make sure you know how missing data are coded! Think about why they are missing; ie if there is no garage then there can be no garage condition

42 Summary Make sure you know how missing data are coded! Think about why they are missing; ie if there is no garage then there can be no garage condition Joint Models require understanding more about the data and reasons for missingness and more sophisticated modelling

43 Summary Make sure you know how missing data are coded! Think about why they are missing; ie if there is no garage then there can be no garage condition Joint Models require understanding more about the data and reasons for missingness and more sophisticated modelling

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1