Missing Data and Imputation

Size: px
Start display at page:

Download "Missing Data and Imputation"

Transcription

1 Missing Data and Imputation Hoff Chapter 7, GH Chapter 25 April 21, 2017

2 Bednets and Malaria Y:presence or absence of parasites in a blood smear AGE: age of child BEDNET: bed net use (exposure) GREEN:greenness of the surrounding vegetation based on satellite photography PHC: whether a village is part of a primary health-care system

3 Bednets and Malaria malaria = readcsv("gambiadat", header=true) summary(malaria) Y AGE BEDNET GREEN Min :00000 Min :1000 Min :00000 Min :2885 Min 1st Qu: st Qu:1000 1st Qu: st Qu:4085 1st Q Median :00000 Median :2000 Median :10000 Median :4085 Media Mean :03093 Mean :2399 Mean :07049 Mean :3984 Mean 3rd Qu: rd Qu:3000 3rd Qu: rd Qu:4085 3rd Q Max :10000 Max :4000 Max :10000 Max :4765 Max NA's :317 39% missing

4 More about missingness Consider Probability of missingness - are certain groups more likely to have missing data?

5 More about missingness Consider Probability of missingness - are certain groups more likely to have missing data? Are certain responses more likely to be missing? (ie individuals with high income are more likely to not report it) probability of missing depends on value of outcome

6 More about missingness Consider Probability of missingness - are certain groups more likely to have missing data? Are certain responses more likely to be missing? (ie individuals with high income are more likely to not report it) probability of missing depends on value of outcome Analysis depends on assumptions about missingness

7 Mechanisms for Missingness Missing Completely at random (MCAR): missingness does not depend on outcome or other variables

8 Mechanisms for Missingness Missing Completely at random (MCAR): missingness does not depend on outcome or other variables Missing at Random: missing does not depend on value of variable, but may depend on other variables

9 Mechanisms for Missingness Missing Completely at random (MCAR): missingness does not depend on outcome or other variables Missing at Random: missing does not depend on value of variable, but may depend on other variables Missing Not at Random: missingness depends on the variable that is missing

10 Mechanisms for Missingness Missing Completely at random (MCAR): missingness does not depend on outcome or other variables Missing at Random: missing does not depend on value of variable, but may depend on other variables Missing Not at Random: missingness depends on the variable that is missing Cannot tell from data

11 Modeling Delete subjects with any missing observations This would remove 39 % of the data and reduces power Induces Bias if data are not missing completely at random!

12 Modeling Delete subjects with any missing observations This would remove 39 % of the data and reduces power Induces Bias if data are not missing completely at random! Replace each missing value with an estimated mean (plug-in approach) This implies that we are certain about the values of the missing cases, so any measures of uncertainty in parameter estimates are overly optimistic (too narrow) Distorts correlation structure in data

13 Modeling Delete subjects with any missing observations This would remove 39 % of the data and reduces power Induces Bias if data are not missing completely at random! Replace each missing value with an estimated mean (plug-in approach) This implies that we are certain about the values of the missing cases, so any measures of uncertainty in parameter estimates are overly optimistic (too narrow) Distorts correlation structure in data Work with likelihoods based on observed data; this will be a product of marginal distributions, difficult to work with

14 Modeling Delete subjects with any missing observations This would remove 39 % of the data and reduces power Induces Bias if data are not missing completely at random! Replace each missing value with an estimated mean (plug-in approach) This implies that we are certain about the values of the missing cases, so any measures of uncertainty in parameter estimates are overly optimistic (too narrow) Distorts correlation structure in data Work with likelihoods based on observed data; this will be a product of marginal distributions, difficult to work with Model Based Methods

15 Observed Data (Y i,1, Y i,2, Y i,3, Y i,4, Y i,5 ) (O i,1, O i,2, O i,3, O i,4, O i,5 )

16 Observed Data (Y i,1, Y i,2, Y i,3, Y i,4, Y i,5 ) (O i,1, O i,2, O i,3, O i,4, O i,5 ) where O i,j is 1 if Y i,j is observed and O i,j is 0 if Y i,j is missing

17 Observed Data (Y i,1, Y i,2, Y i,3, Y i,4, Y i,5 ) (O i,1, O i,2, O i,3, O i,4, O i,5 ) where O i,j is 1 if Y i,j is observed and O i,j is 0 if Y i,j is missing Missing at Random Data: O i and Y i are independent given θ

18 Observed Data (Y i,1, Y i,2, Y i,3, Y i,4, Y i,5 ) (O i,1, O i,2, O i,3, O i,4, O i,5 ) where O i,j is 1 if Y i,j is observed and O i,j is 0 if Y i,j is missing Missing at Random Data: O i and Y i are independent given θ distribution for O i does not depend on θ

19 Observed Data (Y i,1, Y i,2, Y i,3, Y i,4, Y i,5 ) (O i,1, O i,2, O i,3, O i,4, O i,5 ) where O i,j is 1 if Y i,j is observed and O i,j is 0 if Y i,j is missing Missing at Random Data: O i and Y i are independent given θ distribution for O i does not depend on θ Marginal Model for observed data p(o i, y[o i = 1] θ) = p(o i )p(y[o i = 1] θ) = p(o i ) p(y i,1, y i,2, y i,3, y i,4, y i,5 θ) y i,j o i,j =0 dy i,j

20 Observed Data (Y i,1, Y i,2, Y i,3, Y i,4, Y i,5 ) (O i,1, O i,2, O i,3, O i,4, O i,5 ) where O i,j is 1 if Y i,j is observed and O i,j is 0 if Y i,j is missing Missing at Random Data: O i and Y i are independent given θ distribution for O i does not depend on θ Marginal Model for observed data p(o i, y[o i = 1] θ) = p(o i )p(y[o i = 1] θ) = p(o i ) p(y i,1, y i,2, y i,3, y i,4, y i,5 θ) Integrate over the missing variables to obtain the likelihood y i,j o i,j =0 dy i,j

21 Use the Gibbs Sampler to Integrate If we had complete data then we would draw θ from the condition distribution of θ Y class for sampling µ and Σ Add a step at each iteration to generate the missing data:

22 Use the Gibbs Sampler to Integrate If we had complete data then we would draw θ from the condition distribution of θ Y class for sampling µ and Σ Add a step at each iteration to generate the missing data: Generate Y (t+1) miss from p(y miss Y obs, θ (t) ) and fill in the missing data to obtain a complete matrix Y from Y obs and Y miss

23 Use the Gibbs Sampler to Integrate If we had complete data then we would draw θ from the condition distribution of θ Y class for sampling µ and Σ Add a step at each iteration to generate the missing data: Generate Y (t+1) miss from p(y miss Y obs, θ (t) ) and fill in the missing data to obtain a complete matrix Y from Y obs and Y miss Generate θ (t+1) from p(θ Y obs, Y (t+1) miss, )

24 Use the Gibbs Sampler to Integrate If we had complete data then we would draw θ from the condition distribution of θ Y class for sampling µ and Σ Add a step at each iteration to generate the missing data: Generate Y (t+1) miss from p(y miss Y obs, θ (t) ) and fill in the missing data to obtain a complete matrix Y from Y obs and Y miss Generate θ (t+1) from p(θ Y obs, Y (t+1) miss, ) Averaging over the draws of Y miss integrates marginalizes over the missing dimensions

25 JAGS Model model = function() { for (i in 1:N) { Y[i] ~ dbern(p[i]) logit(p[i]) <- alpha + betaage*age[i] + betabednet*bednet[i] +betagreen*green[i] + betaphc*phc[i] } # model for missing exposure variable for (i in 1:N) { BEDNET[i] ~ dbern(q) #prior model for whether or not child # sleeps under treated bednet } #uniform prior (uniform) on prob of sleeping under bednet q ~ dbeta (1,1) #vague priors on regression coefficients alpha ~ dnorm(0, ) betaage ~ dnorm(0, ) betabednet ~ dnorm(0, ) betagreen ~ dnorm(0, ) betaphc ~ dnorm(0, ) # calculate odds ratios of interest ORbednet <- exp(betabednet) #OR of malaria for children using bednet }

26 Posterior Density theta = asdataframe(sim$bugsoutput$simsmatrix) plot(density(theta[,1]), xlab="or Bednet", main="") OR Bednet Density

27 JAGS Model model2 = function() { for (i in 1:N) { Y[i] ~ dbern(p[i]) logit(p[i]) <- alpha + betaage*age[i] + betabednet*bednet[i] +betagreen*green[i] + betaphc*phc[i] } # model for missing exposure variable for (i in 1:N) { BEDNET[i] ~ dbern(q[i]) #prior model for bednet use logit(q[i]) <- gamma[1] + gamma[2]*phc[i] #allow prob depend on PHC } #vague priors on regression coefficients gamma[1] ~ dnorm(0, ) gamma[2] ~ dnorm(0, ) alpha ~ dnorm(0, ) betaage ~ dnorm(0, ) betabednet ~ dnorm(0, ) betagreen ~ dnorm(0, ) betaphc ~ dnorm(0, ) # calculate odds ratios of interest ORbednet <- exp(betabednet) #OR of malaria for children using bednet

28 Posterior Density thetaphc = asdataframe(simphc$bugsoutput$simsmatrix) plot(density(thetaphc[,1]), xlab="or Malaria Bednet", main="") OR Malaria Bednet Density

29 Posterior Density plot(density(thetaphc[,"orbednetphc"]), xlab="or BEDNET PHC", main="" OR BEDNET PHC Density

30 intervals exp(confint(glm(y ~, data=malaria, family=binomial), parm="bednet")) 25 % 975 % HPDinterval(asmcmc(theta)) lower upper ORbednet betabednet deviance attr(,"probability") [1] 095 HPDinterval(asmcmc(thetaphc)) lower upper ORbednet ORbednetPHC deviance attr(,"probability")

31 More than one variable with missing data Model each predictor (joint distribution)

32 More than one variable with missing data Model each predictor (joint distribution) Coherent sequential model of conditional distributions

33 More than one variable with missing data Model each predictor (joint distribution) Coherent sequential model of conditional distributions Handle Mix of Discrete and Continuous

34 More than one variable with missing data Model each predictor (joint distribution) Coherent sequential model of conditional distributions Handle Mix of Discrete and Continuous Categorical: Continuation Ratios easiest

35 More than one variable with missing data Model each predictor (joint distribution) Coherent sequential model of conditional distributions Handle Mix of Discrete and Continuous Categorical: Continuation Ratios easiest

36 Missing Not at Random probability of missing depends on predictor

37 Missing Not at Random probability of missing depends on predictor need to model joint missingness indicator and outcomes

38 Missing Not at Random probability of missing depends on predictor need to model joint missingness indicator and outcomes model missingness given variables

39 Missing Not at Random probability of missing depends on predictor need to model joint missingness indicator and outcomes model missingness given variables need more information!

40 Summary Make sure you know how missing data are coded!

41 Summary Make sure you know how missing data are coded! Think about why they are missing; ie if there is no garage then there can be no garage condition

42 Summary Make sure you know how missing data are coded! Think about why they are missing; ie if there is no garage then there can be no garage condition Joint Models require understanding more about the data and reasons for missingness and more sophisticated modelling

43 Summary Make sure you know how missing data are coded! Think about why they are missing; ie if there is no garage then there can be no garage condition Joint Models require understanding more about the data and reasons for missingness and more sophisticated modelling

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1

More information

Lecture 26: Missing data

Lecture 26: Missing data Lecture 26: Missing data Reading: ESL 9.6 STATS 202: Data mining and analysis December 1, 2017 1 / 10 Missing data is everywhere Survey data: nonresponse. 2 / 10 Missing data is everywhere Survey data:

More information

Markov Chain Monte Carlo (part 1)

Markov Chain Monte Carlo (part 1) Markov Chain Monte Carlo (part 1) Edps 590BAY Carolyn J. Anderson Department of Educational Psychology c Board of Trustees, University of Illinois Spring 2018 Depending on the book that you select for

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup For our analysis goals we would like to do: Y X N (X, 2 I) and then interpret the coefficients

More information

Missing Data and Imputation

Missing Data and Imputation Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH, 2017 Outline Types of missing data Simple methods for dealing with missing data Single and multiple imputation R example Missing data is a complex

More information

Bayesian model selection and diagnostics

Bayesian model selection and diagnostics Bayesian model selection and diagnostics A typical Bayesian analysis compares a handful of models. Example 1: Consider the spline model for the motorcycle data, how many basis functions? Example 2: Consider

More information

MCMC Methods for data modeling

MCMC Methods for data modeling MCMC Methods for data modeling Kenneth Scerri Department of Automatic Control and Systems Engineering Introduction 1. Symposium on Data Modelling 2. Outline: a. Definition and uses of MCMC b. MCMC algorithms

More information

Approaches to Missing Data

Approaches to Missing Data Approaches to Missing Data A Presentation by Russell Barbour, Ph.D. Center for Interdisciplinary Research on AIDS (CIRA) and Eugenia Buta, Ph.D. CIRA and The Yale Center of Analytical Studies (YCAS) April

More information

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Multiple Imputation for Missing Data Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Outline Missing data mechanisms What is Multiple Imputation? Software Options

More information

Linear Modeling with Bayesian Statistics

Linear Modeling with Bayesian Statistics Linear Modeling with Bayesian Statistics Bayesian Approach I I I I I Estimate probability of a parameter State degree of believe in specific parameter values Evaluate probability of hypothesis given the

More information

Statistical Matching using Fractional Imputation

Statistical Matching using Fractional Imputation Statistical Matching using Fractional Imputation Jae-Kwang Kim 1 Iowa State University 1 Joint work with Emily Berg and Taesung Park 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application:

More information

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24 MCMC Diagnostics Yingbo Li Clemson University MATH 9810 Yingbo Li (Clemson) MCMC Diagnostics MATH 9810 1 / 24 Convergence to Posterior Distribution Theory proves that if a Gibbs sampler iterates enough,

More information

1 Methods for Posterior Simulation

1 Methods for Posterior Simulation 1 Methods for Posterior Simulation Let p(θ y) be the posterior. simulation. Koop presents four methods for (posterior) 1. Monte Carlo integration: draw from p(θ y). 2. Gibbs sampler: sequentially drawing

More information

Logistic Regression. (Dichotomous predicted variable) Tim Frasier

Logistic Regression. (Dichotomous predicted variable) Tim Frasier Logistic Regression (Dichotomous predicted variable) Tim Frasier Copyright Tim Frasier This work is licensed under the Creative Commons Attribution 4.0 International license. Click here for more information.

More information

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a Chapter 3 Missing Data 3.1 Types of Missing Data ˆ Missing completely at random (MCAR) ˆ Missing at random (MAR) a ˆ Informative missing (non-ignorable non-response) See 1, 38, 59 for an introduction to

More information

Missing Data. Where did it go?

Missing Data. Where did it go? Missing Data Where did it go? 1 Learning Objectives High-level discussion of some techniques Identify type of missingness Single vs Multiple Imputation My favourite technique 2 Problem Uh data are missing

More information

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology CART Classification and Regression Trees Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology CART CART stands for Classification And Regression Trees.

More information

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS Examples: Missing Data Modeling And Bayesian Analysis CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS Mplus provides estimation of models with missing data using both frequentist and Bayesian

More information

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016 Resampling Methods Levi Waldron, CUNY School of Public Health July 13, 2016 Outline and introduction Objectives: prediction or inference? Cross-validation Bootstrap Permutation Test Monte Carlo Simulation

More information

Multiple Imputation with Mplus

Multiple Imputation with Mplus Multiple Imputation with Mplus Tihomir Asparouhov and Bengt Muthén Version 2 September 29, 2010 1 1 Introduction Conducting multiple imputation (MI) can sometimes be quite intricate. In this note we provide

More information

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically

More information

Handling missing data for indicators, Susanne Rässler 1

Handling missing data for indicators, Susanne Rässler 1 Handling Missing Data for Indicators Susanne Rässler Institute for Employment Research & Federal Employment Agency Nürnberg, Germany First Workshop on Indicators in the Knowledge Economy, Tübingen, 3-4

More information

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University Expectation Maximization Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University April 10 th, 2006 1 Announcements Reminder: Project milestone due Wednesday beginning of class 2 Coordinate

More information

(Not That) Advanced Hierarchical Models

(Not That) Advanced Hierarchical Models (Not That) Advanced Hierarchical Models Ben Goodrich StanCon: January 10, 2018 Ben Goodrich Advanced Hierarchical Models StanCon 1 / 13 Obligatory Disclosure Ben is an employee of Columbia University,

More information

Clustering Relational Data using the Infinite Relational Model

Clustering Relational Data using the Infinite Relational Model Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015

More information

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis 7 Computer Vision and Classification 413 / 458 Computer Vision and Classification The k-nearest-neighbor method The k-nearest-neighbor (knn) procedure has been used in data analysis and machine learning

More information

Types of missingness and common strategies

Types of missingness and common strategies 9 th UK Stata Users Meeting 20 May 2003 Multiple imputation for missing data in life course studies Bianca De Stavola and Valerie McCormack (London School of Hygiene and Tropical Medicine) Motivating example

More information

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Missing Data. SPIDA 2012 Part 6 Mixed Models with R: The best solution to the missing data problem is not to have any. Stef van Buuren, developer of mice SPIDA 2012 Part 6 Mixed Models with R: Missing Data Georges Monette 1 May 2012 Email: georges@yorku.ca

More information

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Computer vision: models, learning and inference. Chapter 10 Graphical Models Computer vision: models, learning and inference Chapter 10 Graphical Models Independence Two variables x 1 and x 2 are independent if their joint probability distribution factorizes as Pr(x 1, x 2 )=Pr(x

More information

BART STAT8810, Fall 2017

BART STAT8810, Fall 2017 BART STAT8810, Fall 2017 M.T. Pratola November 1, 2017 Today BART: Bayesian Additive Regression Trees BART: Bayesian Additive Regression Trees Additive model generalizes the single-tree regression model:

More information

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background An Introduction to Multiple Imputation and its Application Craig K. Enders University of California - Los Angeles Department of Psychology cenders@psych.ucla.edu Background Work supported by Institute

More information

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Regression. Dr. G. Bharadwaja Kumar VIT Chennai Regression Dr. G. Bharadwaja Kumar VIT Chennai Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called

More information

Bayesian Inference for Sample Surveys

Bayesian Inference for Sample Surveys Bayesian Inference for Sample Surveys Trivellore Raghunathan (Raghu) Director, Survey Research Center Professor of Biostatistics University of Michigan Distinctive features of survey inference 1. Primary

More information

Handling Data with Three Types of Missing Values:

Handling Data with Three Types of Missing Values: Handling Data with Three Types of Missing Values: A Simulation Study Jennifer Boyko Advisor: Ofer Harel Department of Statistics University of Connecticut Storrs, CT May 21, 2013 Jennifer Boyko Handling

More information

Calibration and emulation of TIE-GCM

Calibration and emulation of TIE-GCM Calibration and emulation of TIE-GCM Serge Guillas School of Mathematics Georgia Institute of Technology Jonathan Rougier University of Bristol Big Thanks to Crystal Linkletter (SFU-SAMSI summer school)

More information

CITS4009 Introduction to Data Science

CITS4009 Introduction to Data Science School of Computer Science and Software Engineering CITS4009 Introduction to Data Science SEMESTER 2, 2017: CHAPTER 4 MANAGING DATA 1 Chapter Objectives Fixing data quality problems Organizing your data

More information

Multiple imputation using chained equations: Issues and guidance for practice

Multiple imputation using chained equations: Issues and guidance for practice Multiple imputation using chained equations: Issues and guidance for practice Ian R. White, Patrick Royston and Angela M. Wood http://onlinelibrary.wiley.com/doi/10.1002/sim.4067/full By Gabrielle Simoneau

More information

Bayesian Computation with JAGS

Bayesian Computation with JAGS JAGS is Just Another Gibbs Sampler Cross-platform Accessible from within R Bayesian Computation with JAGS What I did Downloaded and installed JAGS. In the R package installer, downloaded rjags and dependencies.

More information

Monte Carlo for Spatial Models

Monte Carlo for Spatial Models Monte Carlo for Spatial Models Murali Haran Department of Statistics Penn State University Penn State Computational Science Lectures April 2007 Spatial Models Lots of scientific questions involve analyzing

More information

Introduction to Bayesian Analysis in Stata

Introduction to Bayesian Analysis in Stata tools Introduction to Bayesian Analysis in Gustavo Sánchez Corp LLC September 15, 2017 Porto, Portugal tools 1 Bayesian analysis: 2 Basic Concepts The tools 14: The command 15: The bayes prefix Postestimation

More information

Estimation of Item Response Models

Estimation of Item Response Models Estimation of Item Response Models Lecture #5 ICPSR Item Response Theory Workshop Lecture #5: 1of 39 The Big Picture of Estimation ESTIMATOR = Maximum Likelihood; Mplus Any questions? answers Lecture #5:

More information

Panel Data 4: Fixed Effects vs Random Effects Models

Panel Data 4: Fixed Effects vs Random Effects Models Panel Data 4: Fixed Effects vs Random Effects Models Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised April 4, 2017 These notes borrow very heavily, sometimes verbatim,

More information

arxiv: v1 [stat.me] 29 May 2015

arxiv: v1 [stat.me] 29 May 2015 MIMCA: Multiple imputation for categorical variables with multiple correspondence analysis Vincent Audigier 1, François Husson 2 and Julie Josse 2 arxiv:1505.08116v1 [stat.me] 29 May 2015 Applied Mathematics

More information

Faculty of Sciences. Holger Cevallos Valdiviezo

Faculty of Sciences. Holger Cevallos Valdiviezo Faculty of Sciences Handling of missing data in the predictor variables when using Tree-based techniques for training and generating predictions Holger Cevallos Valdiviezo Master dissertation submitted

More information

The linear mixed model: modeling hierarchical and longitudinal data

The linear mixed model: modeling hierarchical and longitudinal data The linear mixed model: modeling hierarchical and longitudinal data Analysis of Experimental Data AED The linear mixed model: modeling hierarchical and longitudinal data 1 of 44 Contents 1 Modeling Hierarchical

More information

Multiple Imputation for Continuous and Categorical Data: Comparing Joint and Conditional Approaches

Multiple Imputation for Continuous and Categorical Data: Comparing Joint and Conditional Approaches Multiple Imputation for Continuous and Categorical Data: Comparing Joint and Conditional Approaches Jonathan Kropko University of Virginia Ben Goodrich Columbia University Andrew Gelman Columbia University

More information

Bayes Estimators & Ridge Regression

Bayes Estimators & Ridge Regression Bayes Estimators & Ridge Regression Readings ISLR 6 STA 521 Duke University Merlise Clyde October 27, 2017 Model Assume that we have centered (as before) and rescaled X o (original X) so that X j = X o

More information

Performance of Sequential Imputation Method in Multilevel Applications

Performance of Sequential Imputation Method in Multilevel Applications Section on Survey Research Methods JSM 9 Performance of Sequential Imputation Method in Multilevel Applications Enxu Zhao, Recai M. Yucel New York State Department of Health, 8 N. Pearl St., Albany, NY

More information

Canopy Light: Synthesizing multiple data sources

Canopy Light: Synthesizing multiple data sources Canopy Light: Synthesizing multiple data sources Tree growth depends upon light (previous example, lab 7) Hard to measure how much light an ADULT tree receives Multiple sources of proxy data Exposed Canopy

More information

Geostatistical Reservoir Characterization of McMurray Formation by 2-D Modeling

Geostatistical Reservoir Characterization of McMurray Formation by 2-D Modeling Geostatistical Reservoir Characterization of McMurray Formation by 2-D Modeling Weishan Ren, Oy Leuangthong and Clayton V. Deutsch Department of Civil & Environmental Engineering, University of Alberta

More information

Multiple-imputation analysis using Stata s mi command

Multiple-imputation analysis using Stata s mi command Multiple-imputation analysis using Stata s mi command Yulia Marchenko Senior Statistician StataCorp LP 2009 UK Stata Users Group Meeting Yulia Marchenko (StataCorp) Multiple-imputation analysis using mi

More information

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015 STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, tsvv@steno.dk, Steno Diabetes Center June 11, 2015 Contents 1 Introduction 1 2 Recap: Variables 2 3 Data Containers 2 3.1 Vectors................................................

More information

Bayesian Modelling with JAGS and R

Bayesian Modelling with JAGS and R Bayesian Modelling with JAGS and R Martyn Plummer International Agency for Research on Cancer Rencontres R, 3 July 2012 CRAN Task View Bayesian Inference The CRAN Task View Bayesian Inference is maintained

More information

[/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n} {NOT} {NODF} {NOPROB}] {NOCOUNTS} {NOMEANS}

[/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n} {NOT} {NODF} {NOPROB}] {NOCOUNTS} {NOMEANS} MVA MVA [VARIABLES=] {varlist} {ALL } [/CATEGORICAL=varlist] [/MAXCAT={25 ** }] {n } [/ID=varname] Description: [/NOUNIVARIATE] [/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n}

More information

Semi- Supervised Learning

Semi- Supervised Learning Semi- Supervised Learning Aarti Singh Machine Learning 10-601 Dec 1, 2011 Slides Courtesy: Jerry Zhu 1 Supervised Learning Feature Space Label Space Goal: Optimal predictor (Bayes Rule) depends on unknown

More information

winbugs and openbugs

winbugs and openbugs Eric F. Lock UMN Division of Biostatistics, SPH elock@umn.edu 04/19/2017 Bayesian estimation software Several stand-alone applications and add-ons to estimate Bayesian models Stand-alone applications:

More information

Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018

Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018 Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018 Contents Introduction... 1 Start DIONE... 2 Load Data... 3 Missing Values... 5 Explore Data... 6 One Variable... 6 Two Variables... 7 All

More information

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa Ronald H. Heck 1 In this handout, we will address a number of issues regarding missing data. It is often the case that the weakest point of a study is the quality of the data that can be brought to bear

More information

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed

More information

Bayesian data analysis using R

Bayesian data analysis using R Bayesian data analysis using R BAYESIAN DATA ANALYSIS USING R Jouni Kerman, Samantha Cook, and Andrew Gelman Introduction Bayesian data analysis includes but is not limited to Bayesian inference (Gelman

More information

Expectation-Maximization Methods in Population Analysis. Robert J. Bauer, Ph.D. ICON plc.

Expectation-Maximization Methods in Population Analysis. Robert J. Bauer, Ph.D. ICON plc. Expectation-Maximization Methods in Population Analysis Robert J. Bauer, Ph.D. ICON plc. 1 Objective The objective of this tutorial is to briefly describe the statistical basis of Expectation-Maximization

More information

Poisson Regression and Model Checking

Poisson Regression and Model Checking Poisson Regression and Model Checking Readings GH Chapter 6-8 September 27, 2017 HIV & Risk Behaviour Study The variables couples and women_alone code the intervention: control - no counselling (both 0)

More information

Recap: The E-M algorithm. Biostatistics 615/815 Lecture 22: Gibbs Sampling. Recap - Local minimization methods

Recap: The E-M algorithm. Biostatistics 615/815 Lecture 22: Gibbs Sampling. Recap - Local minimization methods Recap: The E-M algorithm Biostatistics 615/815 Lecture 22: Gibbs Sampling Expectation step (E-step) Given the current estimates of parameters λ (t), calculate the conditional distribution of latent variable

More information

1 RefresheR. Figure 1.1: Soy ice cream flavor preferences

1 RefresheR. Figure 1.1: Soy ice cream flavor preferences 1 RefresheR Figure 1.1: Soy ice cream flavor preferences 2 The Shape of Data Figure 2.1: Frequency distribution of number of carburetors in mtcars dataset Figure 2.2: Daily temperature measurements from

More information

FHDI: An R Package for Fractional Hot Deck Imputation by Jongho Im, In Ho Cho, and Jae Kwang Kim

FHDI: An R Package for Fractional Hot Deck Imputation by Jongho Im, In Ho Cho, and Jae Kwang Kim CONTRIBUTED RESEARCH ARTICLE 140 FHDI: An R Package for Fractional Hot Deck Imputation by Jongho Im, In Ho Cho, and Jae Kwang Kim Abstract Fractional hot deck imputation (FHDI), proposed by Kalton and

More information

A Basic Example of ANOVA in JAGS Joel S Steele

A Basic Example of ANOVA in JAGS Joel S Steele A Basic Example of ANOVA in JAGS Joel S Steele The purpose This demonstration is intended to show how a simple one-way ANOVA can be coded and run in the JAGS framework. This is by no means an exhaustive

More information

CHAPTER 3. BUILDING A USEFUL EXPONENTIAL RANDOM GRAPH MODEL

CHAPTER 3. BUILDING A USEFUL EXPONENTIAL RANDOM GRAPH MODEL CHAPTER 3. BUILDING A USEFUL EXPONENTIAL RANDOM GRAPH MODEL Essentially, all models are wrong, but some are useful. Box and Draper (1979, p. 424), as cited in Box and Draper (2007) For decades, network

More information

PSS718 - Data Mining

PSS718 - Data Mining Lecture 5 - Hacettepe University October 23, 2016 Data Issues Improving the performance of a model To improve the performance of a model, we mostly improve the data Source additional data Clean up the

More information

[spa-temp.inf] Spatial-temporal information

[spa-temp.inf] Spatial-temporal information [spa-temp.inf] Spatial-temporal information VI Table of Contents for Spatial-temporal information I. Spatial-temporal information........................................... VI - 1 A. Cohort-survival method.........................................

More information

Temporal Modeling and Missing Data Estimation for MODIS Vegetation data

Temporal Modeling and Missing Data Estimation for MODIS Vegetation data Temporal Modeling and Missing Data Estimation for MODIS Vegetation data Rie Honda 1 Introduction The Moderate Resolution Imaging Spectroradiometer (MODIS) is the primary instrument on board NASA s Earth

More information

HANDLING MISSING DATA

HANDLING MISSING DATA GSO international workshop Mathematic, biostatistics and epidemiology of cancer Modeling and simulation of clinical trials Gregory GUERNEC 1, Valerie GARES 1,2 1 UMR1027 INSERM UNIVERSITY OF TOULOUSE III

More information

Regression III: Lab 4

Regression III: Lab 4 Regression III: Lab 4 This lab will work through some model/variable selection problems, finite mixture models and missing data issues. You shouldn t feel obligated to work through this linearly, I would

More information

Warped Mixture Models

Warped Mixture Models Warped Mixture Models Tomoharu Iwata, David Duvenaud, Zoubin Ghahramani Cambridge University Computational and Biological Learning Lab March 11, 2013 OUTLINE Motivation Gaussian Process Latent Variable

More information

Machine Learning in Telecommunications

Machine Learning in Telecommunications Machine Learning in Telecommunications Paulos Charonyktakis & Maria Plakia Department of Computer Science, University of Crete Institute of Computer Science, FORTH Roadmap Motivation Supervised Learning

More information

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy Simulation of Imputation Effects Under Different Assumptions Danny Rithy ABSTRACT Missing data is something that we cannot always prevent. Data can be missing due to subjects' refusing to answer a sensitive

More information

Package EMLRT. August 7, 2014

Package EMLRT. August 7, 2014 Package EMLRT August 7, 2014 Type Package Title Association Studies with Imputed SNPs Using Expectation-Maximization-Likelihood-Ratio Test LazyData yes Version 1.0 Date 2014-08-01 Author Maintainer

More information

Expected Value of Partial Perfect Information in Hybrid Models Using Dynamic Discretization

Expected Value of Partial Perfect Information in Hybrid Models Using Dynamic Discretization Received September 13, 2017, accepted January 15, 2018, date of publication January 31, 2018, date of current version March 12, 2018. Digital Object Identifier 10.1109/ACCESS.2018.2799527 Expected Value

More information

Machine Learning: An Applied Econometric Approach Online Appendix

Machine Learning: An Applied Econometric Approach Online Appendix Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail

More information

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999.

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999. 2 Schafer, J. L., Graham, J. W.: (2002). Missing Data: Our View of the State of the Art. Psychological methods, 2002, Vol 7, No 2, 47 77 Rosner, B. (2005) Fundamentals of Biostatistics, 6th ed, Wiley.

More information

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017 Predictive Checking Readings GH Chapter 6-8 February 8, 2017 Model Choice and Model Checking 2 Questions: 1. Is my Model good enough? (no alternative models in mind) 2. Which Model is best? (comparison

More information

Multiple Imputation for Multilevel Models with Missing Data Using Stat-JR

Multiple Imputation for Multilevel Models with Missing Data Using Stat-JR Multiple Imputation for Multilevel Models with Missing Data Using Stat-JR Introduction In this document we introduce a Stat-JR super-template for 2-level data that allows for missing values in explanatory

More information

Will Monroe July 21, with materials by Mehran Sahami and Chris Piech. Joint Distributions

Will Monroe July 21, with materials by Mehran Sahami and Chris Piech. Joint Distributions Will Monroe July 1, 017 with materials by Mehran Sahami and Chris Piech Joint Distributions Review: Normal random variable An normal (= Gaussian) random variable is a good approximation to many other distributions.

More information

Organizing data in R. Fitting Mixed-Effects Models Using the lme4 Package in R. R packages. Accessing documentation. The Dyestuff data set

Organizing data in R. Fitting Mixed-Effects Models Using the lme4 Package in R. R packages. Accessing documentation. The Dyestuff data set Fitting Mixed-Effects Models Using the lme4 Package in R Deepayan Sarkar Fred Hutchinson Cancer Research Center 18 September 2008 Organizing data in R Standard rectangular data sets (columns are variables,

More information

ISyE8843A, Brani Vidakovic Handout 14

ISyE8843A, Brani Vidakovic Handout 14 ISyE8843A, Brani Vidakovic Handout 4 BUGS BUGS is freely available software for constructing Bayesian statistical models and evaluating them using MCMC methodology. BUGS and WINBUGS are distributed freely

More information

Graphical Models, Bayesian Method, Sampling, and Variational Inference

Graphical Models, Bayesian Method, Sampling, and Variational Inference Graphical Models, Bayesian Method, Sampling, and Variational Inference With Application in Function MRI Analysis and Other Imaging Problems Wei Liu Scientific Computing and Imaging Institute University

More information

Lecture 12. August 23, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Lecture 12. August 23, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University. Lecture 12 Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University August 23, 2007 1 2 3 4 5 1 2 Introduce the bootstrap 3 the bootstrap algorithm 4 Example

More information

Problem Set 4. Assigned: March 23, 2006 Due: April 17, (6.882) Belief Propagation for Segmentation

Problem Set 4. Assigned: March 23, 2006 Due: April 17, (6.882) Belief Propagation for Segmentation 6.098/6.882 Computational Photography 1 Problem Set 4 Assigned: March 23, 2006 Due: April 17, 2006 Problem 1 (6.882) Belief Propagation for Segmentation In this problem you will set-up a Markov Random

More information

Outline. Bayesian Data Analysis Hierarchical models. Rat tumor data. Errandum: exercise GCSR 3.11

Outline. Bayesian Data Analysis Hierarchical models. Rat tumor data. Errandum: exercise GCSR 3.11 Outline Bayesian Data Analysis Hierarchical models Helle Sørensen May 15, 2009 Today: More about the rat tumor data: model, derivation of posteriors, the actual computations in R. : a hierarchical normal

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

A Nonparametric Bayesian Approach to Detecting Spatial Activation Patterns in fmri Data

A Nonparametric Bayesian Approach to Detecting Spatial Activation Patterns in fmri Data A Nonparametric Bayesian Approach to Detecting Spatial Activation Patterns in fmri Data Seyoung Kim, Padhraic Smyth, and Hal Stern Bren School of Information and Computer Sciences University of California,

More information

optimization_machine_probit_bush106.c

optimization_machine_probit_bush106.c optimization_machine_probit_bush106.c. probit ybush black00 south hispanic00 income owner00 dwnom1n dwnom2n Iteration 0: log likelihood = -299.27289 Iteration 1: log likelihood = -154.89847 Iteration 2:

More information

Self Lane Assignment Using Smart Mobile Camera For Intelligent GPS Navigation and Traffic Interpretation

Self Lane Assignment Using Smart Mobile Camera For Intelligent GPS Navigation and Traffic Interpretation For Intelligent GPS Navigation and Traffic Interpretation Tianshi Gao Stanford University tianshig@stanford.edu 1. Introduction Imagine that you are driving on the highway at 70 mph and trying to figure

More information

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics Practical 1: Getting started in OpenBUGS Slide 1 An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics Dr. Christian Asseburg Centre for Health Economics Practical 1 Getting

More information

A GENERAL GIBBS SAMPLING ALGORITHM FOR ANALYZING LINEAR MODELS USING THE SAS SYSTEM

A GENERAL GIBBS SAMPLING ALGORITHM FOR ANALYZING LINEAR MODELS USING THE SAS SYSTEM A GENERAL GIBBS SAMPLING ALGORITHM FOR ANALYZING LINEAR MODELS USING THE SAS SYSTEM Jayawant Mandrekar, Daniel J. Sargent, Paul J. Novotny, Jeff A. Sloan Mayo Clinic, Rochester, MN 55905 ABSTRACT A general

More information

Missing Data: What Are You Missing?

Missing Data: What Are You Missing? Missing Data: What Are You Missing? Craig D. Newgard, MD, MPH Jason S. Haukoos, MD, MS Roger J. Lewis, MD, PhD Society for Academic Emergency Medicine Annual Meeting San Francisco, CA May 006 INTRODUCTION

More information

R Programming Basics - Useful Builtin Functions for Statistics

R Programming Basics - Useful Builtin Functions for Statistics R Programming Basics - Useful Builtin Functions for Statistics Vectorized Arithmetic - most arthimetic operations in R work on vectors. Here are a few commonly used summary statistics. testvect = c(1,3,5,2,9,10,7,8,6)

More information

- 1 - Fig. A5.1 Missing value analysis dialog box

- 1 - Fig. A5.1 Missing value analysis dialog box WEB APPENDIX Sarstedt, M. & Mooi, E. (2019). A concise guide to market research. The process, data, and methods using SPSS (3 rd ed.). Heidelberg: Springer. Missing Value Analysis and Multiple Imputation

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

NONPARAMETRIC REGRESSION SPLINES FOR GENERALIZED LINEAR MODELS IN THE PRESENCE OF MEASUREMENT ERROR

NONPARAMETRIC REGRESSION SPLINES FOR GENERALIZED LINEAR MODELS IN THE PRESENCE OF MEASUREMENT ERROR NONPARAMETRIC REGRESSION SPLINES FOR GENERALIZED LINEAR MODELS IN THE PRESENCE OF MEASUREMENT ERROR J. D. Maca July 1, 1997 Abstract The purpose of this manual is to demonstrate the usage of software for

More information

Statistical matching: conditional. independence assumption and auxiliary information

Statistical matching: conditional. independence assumption and auxiliary information Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information Outline The conditional

More information