Statistical Matching using Fractional Imputation

Size: px
Start display at page:

Download "Statistical Matching using Fractional Imputation"

Transcription

1 Statistical Matching using Fractional Imputation Jae-Kwang Kim 1 Iowa State University 1 Joint work with Emily Berg and Taesung Park

2 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 2 / 35

3 Introduction Motivation Combine information from several surveys Example: Two surveys 1 Survey A: Observe X and Y 1 2 Survey B: Observe X and Y 2 Want to create a data file with X, Y 1, Y 2. If Survey B sample is a subset of Survey A sample, then we may use record linkage technique to obtain Y 1 value for survey B sample. What if the two samples are independent? Kim (ISU) Matching 3 / 35

4 Introduction Table : A Simple Data structure for Matching X Y 1 Y 2 Sample A o o Sample B o o Kim (ISU) Matching 4 / 35

5 Introduction Table : Data after statistical matching X Y 1 Y 2 Sample A o o o Sample B o o o Also called data fusion, or data combination. Kim (ISU) Matching 5 / 35

6 Introduction Example 1 Split questionnaire design Split the original sample into two groups In group 1, ask (x, y 1 ) In group 2, ask (x, y 2 ) Often used to reduce the response burden (and improve the quality of the survey responses). Kim (ISU) Matching 6 / 35

7 Introduction Example 2 Combining two surveys Survey A: Health-related survey Survey B: Socio-Economic surveys x: demographic variable, y 1 : health status variable, y 2 : socio-economic variable Interested in fitting a regression of y 1 (e.g. Obesity) on x and y 2 using two surveys. Two samples should be obtained from the same finite population. Kim (ISU) Matching 7 / 35

8 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 8 / 35

9 Introduction Idea We want to create Y 1 for each element in sample B by finding a statistical twin from the sample A. Often based on the assumption that Y 1 and Y 2 are conditionally independent, conditional on X. That is, Y 1 Y 2 X Under CI (Conditional Independence) assumption, we have f (y 1 x, y 2 ) = f (y 1 x) and the statistical twin is solely determined by how close they are in terms of x s. Kim (ISU) Matching 9 / 35

10 Introduction Remark Under the assumption that (X, Y 1, Y 2 ) are multivariate normal, the CI assumption means that σ 12 = σ 1x σ 2x /σ xx and ρ 12 = ρ 1x ρ 2x. That is, σ 12 is determined from other parameters, rather than estimated from the realized samples. Kim (ISU) Matching 10 / 35

11 Existing Methods Methods under CI assumption Synthetic data imputation: 1 Estimate f (y 1 x) from sample A, denoted by ˆf a (y 1 x). 2 For each element in sample B, use the x i value to create imputed value(s) from ˆf a (y 1 x). Matching: Two-step method Instead of using the synthetic values directly for imputation, synthetic values are used to identify the statistical twins in sample A. The identified twin in sample A is used as the imputed value. Kim (ISU) Matching 11 / 35

12 Existing Methods Some popular methods under CI assumption Parametric approach : Often based on the parametric model or regression model ŷ 1i = ˆβ 0 + ˆβ 1 x i Nonparametric approach Random hot deck Rank hot deck Distance hot deck Reference D Orazio, Di Zio, and Scanu (2006). Statistical Matching: Theory and Practice, Wiley. Kim (ISU) Matching 12 / 35

13 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 13 / 35

14 New Approach Motivation The regression of Y 1 on X and Y 2 will provide insignificant regression coefficient on Y 2. That is, the p-value for ˆβ 2 will be large in ŷ 1 = ˆβ 0 + ˆβ 1 x + ˆβ 2 y 2 CI assumption is often unrealistic! For example, 1 Often X is demographic variable 2 Y 1 is social-behavior (or public health) 3 Y 2 is economic variable (e.g. HH income) In this case, we may have Corr(Y 1, Y 2 X ) 0 Kim (ISU) Matching 14 / 35

15 New Approach Alternative interpretation We can view the problem as an omitted variable regression problem. y 1 = β (1) 0 + β (1) 1 x + β(1) 2 z + e 1 y 2 = β (2) 0 + β (2) 1 x + β(2) 2 z + e 2 where z, e 1, e 2 are never observed. e 1 and e 2 are independent. z is an unobservable confounding factor that explains Cov(y 1, y 2 x) 0. Thus, if we fit a regression of (y 1, y 2 ) on x, then the error terms are still correlated. Kim (ISU) Matching 15 / 35

16 New Approach Instrumental variable Under CI assumption, imputed values are generated from f (y 1 x), which completely ignores the observed information of y 2. Let s try to generate imputed values from f (y 1 x, y 2 ). However, we cannot estimate the parameters in f (y 1 x, y 2 ). Use instrumental variable assumption for identification of the models. Kim (ISU) Matching 16 / 35

17 New Approach Idea Decompose X = (X 1, X 2 ) such that (i) f (y 1 x 1, x 2, y 2 ) = f (y 1 x 1, y 2 ) (ii) f (y 1 x 1, x 2 = a) f (y 1 x 1, x 2 = b) for some a b. X 2 is often called instrumental variable (IV) for Y 2 Kim (ISU) Matching 17 / 35

18 New Approach Propose method Under IV assumption, f (y 1 x, y 2 ) f (y 1 x) f (y 2 x 1, y 1 ) The second term can be ignored under CI assumption. The second term incorporates the observed information of y 2 in Sample B. EM algorithm can be used to perform the parameter estimation and prediction simultaneously. E-step can be computationally heavy (Markov Chain Monte Carlo). Metropolis-Hastings algorithm 1 Generate y 1 from ˆf a (y 1 x). 2 Accept y 1 if f (y 2 x 1, y 1 ; ˆθ) is large at the current parameter value ˆθ. Kim (ISU) Matching 18 / 35

19 New Approach Propose method Parametric fractional imputation (PFI) of Kim (2011) is an alternative computational tool that does not involve MCMC computation but still implements EM algorithm with intractable E-step. PFI uses importance sampling: When the target distribution is f (y 1 x, y 2 ) f (y 1 x) f (y 2 x 1, y 1 ), first generate m values of y1 f (y 1 x) and then use a normalized version of f (y 2 x 1, y1 ) as a weight assigned to y 1. Solve the weighted score equation to update the parameters in the M-step. Kim (ISU) Matching 19 / 35

20 New Approach Propose method: Parametric fractional imputation 1 For each i B, generate m imputed values of y 1, denoted by y (1) 1i,, y (m) 1i, from ˆf a (y 1 x i ). 2 Let ˆθ t be the current parameter value of θ in f (y 2 x 1, y 1 ). For the j-th imputed value y (j) 1i, assign fractional weight where m j=1 w ij = 1. w ij f ( y 2i x 1i, y (j) 1i ; ˆθ t ) 3 Solve the fractionally imputed score equation for θ m w ib i B j=1 w ij S(θ; x 1i, y (j) 1i, y 2i ) = 0 to update ˆθ t+1, where S(θ; x 1, y 1, y 2 ) = log f (y 2 x 1, y 1 ; θ)/ θ. 4 Go to step 2 and continue until convergence. Kim (ISU) Matching 20 / 35

21 Remark Fractional imputation can be understood as a tool for computing a Monte Carlo approximation of the conditional expectation given the observation. Fractionally imputed data file can be used to obtain many different parameters. That is, if a parameter η is defined as a solution to E{U(η; x, y 1, y 2 )} = 0, then a consistent estimator of η can be obtained by the solution to m w ib i B j=1 w ij U(η; x i, y (j) 1i, y 2i ) = 0. Note that the above estimating equation is a Monte Carlo approximation to the following estimating equation: w ib E{U(η; x i, Y 1i, y 2i ) x i, y 2i } = 0. i B For variance estimation, linearization method can be used (Skipped here). Kim (ISU) Matching 21 / 35

22 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 22 / 35

23 Application to Measurement error models Interested in estimating θ in f (y x; θ). Instead of observing x, we observe z which can be highly correlated with x. Thus, z is an instrumental variable for x: f (y x, z) = f (y x) and f (y z = a) f (y z = b) for a b. In addition to original sample, we have a separate calibration sample that observes (x i, z i ). Kim (ISU) Matching 23 / 35

24 Example: Measurement error model Table : External Calibration Study Z X Y Sample A o o Sample B o o Table : Internal Calibration Study Sample Z X Y Validation Subsample o o o Non-validation subsample o o Kim (ISU) Matching 24 / 35

25 Remark Internal calibration study: Two-phase sampling structure Phase One: observe (z, y) Phase Two: validation subsample, observe x in addition to (z, y) Imputation approach for two-phase sampling Estimate f (x z, y) from the second phase sample. For the elements in the phase one sample, generate x ˆf (x z, y). For external calibration study, we use the proposed statistical matching technique under the assumption that f (y x, z) = f (y x). Kim (ISU) Matching 25 / 35

26 Proposed method: Idea In sample B, x is a latent variable (a variable that is always missing). The goal is to generate x in Sample B from f (x i z i, y i ) f (x i z i ) f (y i x i, z i ) = f (x i z i ) f (y i x i ) Obtain a consistent estimator ˆf a (x z) from sample A. May use a Monte Carlo EM algorithm E-step: Generate x (1) i,, x (m) i from f (x i z i, y i ; ˆθ (t) ) ˆf a (x i z i )f (y i x i ; ˆθ (t) ) M-step: Solve the imputed score equation for θ. Kim (ISU) Matching 26 / 35

27 Fractional imputation for EM algorithm The above E-step may be computationally challenging (often relies on a MCMC method) Parametric fractional imputation can be used for easy computation. E-step 1 Generate x (1) i,, x (m) i from ˆf a (x i z i ) in i B. 2 Compute the fractional weights associated with x (j) i w ij f (y i x (j) i ; ˆθ (t) ) and j w ij = 1. M-step: Solve the weighted score equation for θ. by Kim (ISU) Matching 27 / 35

28 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 28 / 35

29 Simulation Setup Measurement error model setup y i Bernoulli(p i ) logit(p i ) = γ 0 + γ x x i z i = β 0 + β 1 x i + u i u i N(0, σ 2 xi 2α ) and x i N(µ x, σx). 2 We observe (x i, z i ), i = 1,..., n A in sample A. In sample B, instead of observing (x i, y i ), we observe (z i, y i ). For the simulation, n A = n B = 800, γ 0 = 1, γ x = 1, β 0 = 0, β 1 = 1, σ 2 = 0.25, α = 0.4, µ x = 0, and σ 2 x = 1. Kim (ISU) Matching 29 / 35

30 Methods 1 Parametric fractional imputation (PFI) 2 Hot deck fractional imputation (HDFI) 3 Naive: Naive estimator obtained from the logistic regression of y i on z i for i B. 4 Bayes: Proposed by Guo and Little (2011). GIBBS sampling is implemented with JAGS. We used 1000 iterations of a single chain for inference, after discarding the first 500 for burn-in. We specify diffuse proper prior distributions for the Bayes estimators. Letting θ 1 = (log(σ 2 x), log(σ 2 ), µ x, β 0, β 1, γ 0, γ x ), we assume a priori that θ 1 N(0, 10 6 I 7 ), where I 7 is a 7 7 identity matrix. The prior distribution for the power α is uniform on the interval [ 5, 5]. 5 Weighted regression calibration (WRC): regression calibration method incorporating the unequal variance in the measurement error model (also considered in Guo and Little, 2011). Kim (ISU) Matching 30 / 35

31 Simulation result Table : Monte Carlo (MC) means, variances, and mean squared errors (MSE) of point estimators of γ x Method MC Bias MC Variance MC MSE PFI HDFI Naive Bayes WRC Kim (ISU) Matching 31 / 35

32 1 Introduction 2 Classical Approaches 3 Proposed method 4 Application: Measurement error models 5 Simulation Study 6 Conclusion Kim (ISU) Matching 32 / 35

33 Concluding Remark Statistical matching is a tool for survey data integration. The current practice of statistical matching is based on conditional independence assumption, which may not be a realistic assumption in practice. A new approach based on instrumental variable is proposed. The proposed method provides statistically valid regression coefficient for the matched data even when CI assumption does not hold. Variance estimation is possible (not covered here). Directly applicable to measurement error model problems and split questionnaire design problems. Kim (ISU) Matching 33 / 35

34 Future research Semi-parametric inference by making ˆf a (y 1 x) nonparametric. f (y 1 x, y 2 ) f (y 1 x) f (y 2 x 1, y 1 ) Application to causal inference: Estimation of average treatment effect from observational studies when we cannot observe the counterfactual outcomes. Combination of two data: one from probability sampling and the other from a non-probability sample. Kim (ISU) Matching 34 / 35

35 The end Kim (ISU) Matching 35 / 35

An imputation approach for analyzing mixed-mode surveys

An imputation approach for analyzing mixed-mode surveys An imputation approach for analyzing mixed-mode surveys Jae-kwang Kim 1 Iowa State University June 4, 2013 1 Joint work with S. Park and S. Kim Ouline Introduction Proposed Methodology Application to Private

More information

Markov chain Monte Carlo methods

Markov chain Monte Carlo methods Markov chain Monte Carlo methods (supplementary material) see also the applet http://www.lbreyer.com/classic.html February 9 6 Independent Hastings Metropolis Sampler Outline Independent Hastings Metropolis

More information

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24 MCMC Diagnostics Yingbo Li Clemson University MATH 9810 Yingbo Li (Clemson) MCMC Diagnostics MATH 9810 1 / 24 Convergence to Posterior Distribution Theory proves that if a Gibbs sampler iterates enough,

More information

A noninformative Bayesian approach to small area estimation

A noninformative Bayesian approach to small area estimation A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported

More information

Statistical matching: conditional. independence assumption and auxiliary information

Statistical matching: conditional. independence assumption and auxiliary information Statistical matching: conditional Training Course Record Linkage and Statistical Matching Mauro Scanu Istat scanu [at] istat.it independence assumption and auxiliary information Outline The conditional

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION Introduction CHAPTER 1 INTRODUCTION Mplus is a statistical modeling program that provides researchers with a flexible tool to analyze their data. Mplus offers researchers a wide choice of models, estimators,

More information

Missing Data and Imputation

Missing Data and Imputation Missing Data and Imputation NINA ORWITZ OCTOBER 30 TH, 2017 Outline Types of missing data Simple methods for dealing with missing data Single and multiple imputation R example Missing data is a complex

More information

MCMC Methods for data modeling

MCMC Methods for data modeling MCMC Methods for data modeling Kenneth Scerri Department of Automatic Control and Systems Engineering Introduction 1. Symposium on Data Modelling 2. Outline: a. Definition and uses of MCMC b. MCMC algorithms

More information

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Topics Validation of biomedical models Data-splitting Resampling Cross-validation

More information

Dynamic Thresholding for Image Analysis

Dynamic Thresholding for Image Analysis Dynamic Thresholding for Image Analysis Statistical Consulting Report for Edward Chan Clean Energy Research Center University of British Columbia by Libo Lu Department of Statistics University of British

More information

1 Methods for Posterior Simulation

1 Methods for Posterior Simulation 1 Methods for Posterior Simulation Let p(θ y) be the posterior. simulation. Koop presents four methods for (posterior) 1. Monte Carlo integration: draw from p(θ y). 2. Gibbs sampler: sequentially drawing

More information

Week 4: Simple Linear Regression II

Week 4: Simple Linear Regression II Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties

More information

Stochastic Simulation: Algorithms and Analysis

Stochastic Simulation: Algorithms and Analysis Soren Asmussen Peter W. Glynn Stochastic Simulation: Algorithms and Analysis et Springer Contents Preface Notation v xii I What This Book Is About 1 1 An Illustrative Example: The Single-Server Queue 1

More information

CS281 Section 9: Graph Models and Practical MCMC

CS281 Section 9: Graph Models and Practical MCMC CS281 Section 9: Graph Models and Practical MCMC Scott Linderman November 11, 213 Now that we have a few MCMC inference algorithms in our toolbox, let s try them out on some random graph models. Graphs

More information

Handling missing data for indicators, Susanne Rässler 1

Handling missing data for indicators, Susanne Rässler 1 Handling Missing Data for Indicators Susanne Rässler Institute for Employment Research & Federal Employment Agency Nürnberg, Germany First Workshop on Indicators in the Knowledge Economy, Tübingen, 3-4

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis 7 Computer Vision and Classification 413 / 458 Computer Vision and Classification The k-nearest-neighbor method The k-nearest-neighbor (knn) procedure has been used in data analysis and machine learning

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Graphical Models, Bayesian Method, Sampling, and Variational Inference

Graphical Models, Bayesian Method, Sampling, and Variational Inference Graphical Models, Bayesian Method, Sampling, and Variational Inference With Application in Function MRI Analysis and Other Imaging Problems Wei Liu Scientific Computing and Imaging Institute University

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Monte Carlo for Spatial Models

Monte Carlo for Spatial Models Monte Carlo for Spatial Models Murali Haran Department of Statistics Penn State University Penn State Computational Science Lectures April 2007 Spatial Models Lots of scientific questions involve analyzing

More information

Performance of Sequential Imputation Method in Multilevel Applications

Performance of Sequential Imputation Method in Multilevel Applications Section on Survey Research Methods JSM 9 Performance of Sequential Imputation Method in Multilevel Applications Enxu Zhao, Recai M. Yucel New York State Department of Health, 8 N. Pearl St., Albany, NY

More information

Linear Modeling with Bayesian Statistics

Linear Modeling with Bayesian Statistics Linear Modeling with Bayesian Statistics Bayesian Approach I I I I I Estimate probability of a parameter State degree of believe in specific parameter values Evaluate probability of hypothesis given the

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 17 EM CS/CNS/EE 155 Andreas Krause Announcements Project poster session on Thursday Dec 3, 4-6pm in Annenberg 2 nd floor atrium! Easels, poster boards and cookies

More information

Handbook of Statistical Modeling for the Social and Behavioral Sciences

Handbook of Statistical Modeling for the Social and Behavioral Sciences Handbook of Statistical Modeling for the Social and Behavioral Sciences Edited by Gerhard Arminger Bergische Universität Wuppertal Wuppertal, Germany Clifford С. Clogg Late of Pennsylvania State University

More information

Clustering Relational Data using the Infinite Relational Model

Clustering Relational Data using the Infinite Relational Model Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

Convexization in Markov Chain Monte Carlo

Convexization in Markov Chain Monte Carlo in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non

More information

Statistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland

Statistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland 1 1 University of Michigan School of Public Health April 2011 Complete (Ideal) vs. Observed

More information

arxiv: v1 [stat.me] 29 May 2015

arxiv: v1 [stat.me] 29 May 2015 MIMCA: Multiple imputation for categorical variables with multiple correspondence analysis Vincent Audigier 1, François Husson 2 and Julie Josse 2 arxiv:1505.08116v1 [stat.me] 29 May 2015 Applied Mathematics

More information

Calibration and emulation of TIE-GCM

Calibration and emulation of TIE-GCM Calibration and emulation of TIE-GCM Serge Guillas School of Mathematics Georgia Institute of Technology Jonathan Rougier University of Bristol Big Thanks to Crystal Linkletter (SFU-SAMSI summer school)

More information

FHDI: An R Package for Fractional Hot Deck Imputation by Jongho Im, In Ho Cho, and Jae Kwang Kim

FHDI: An R Package for Fractional Hot Deck Imputation by Jongho Im, In Ho Cho, and Jae Kwang Kim CONTRIBUTED RESEARCH ARTICLE 140 FHDI: An R Package for Fractional Hot Deck Imputation by Jongho Im, In Ho Cho, and Jae Kwang Kim Abstract Fractional hot deck imputation (FHDI), proposed by Kalton and

More information

Bayesian Modelling with JAGS and R

Bayesian Modelling with JAGS and R Bayesian Modelling with JAGS and R Martyn Plummer International Agency for Research on Cancer Rencontres R, 3 July 2012 CRAN Task View Bayesian Inference The CRAN Task View Bayesian Inference is maintained

More information

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically

More information

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Multiple Imputation for Missing Data Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Outline Missing data mechanisms What is Multiple Imputation? Software Options

More information

Missing Data Missing Data Methods in ML Multiple Imputation

Missing Data Missing Data Methods in ML Multiple Imputation Missing Data Missing Data Methods in ML Multiple Imputation PRE 905: Multivariate Analysis Lecture 11: April 22, 2014 PRE 905: Lecture 11 Missing Data Methods Today s Lecture The basics of missing data:

More information

Collective classification in network data

Collective classification in network data 1 / 50 Collective classification in network data Seminar on graphs, UCSB 2009 Outline 2 / 50 1 Problem 2 Methods Local methods Global methods 3 Experiments Outline 3 / 50 1 Problem 2 Methods Local methods

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea Chapter 3 Bootstrap 3.1 Introduction The estimation of parameters in probability distributions is a basic problem in statistics that one tends to encounter already during the very first course on the subject.

More information

Using DIC to compare selection models with non-ignorable missing responses

Using DIC to compare selection models with non-ignorable missing responses Using DIC to compare selection models with non-ignorable missing responses Abstract Data with missing responses generated by a non-ignorable missingness mechanism can be analysed by jointly modelling the

More information

Statistics (STAT) Statistics (STAT) 1. Prerequisites: grade in C- or higher in STAT 1200 or STAT 1300 or STAT 1400

Statistics (STAT) Statistics (STAT) 1. Prerequisites: grade in C- or higher in STAT 1200 or STAT 1300 or STAT 1400 Statistics (STAT) 1 Statistics (STAT) STAT 1200: Introductory Statistical Reasoning Statistical concepts for critically evaluation quantitative information. Descriptive statistics, probability, estimation,

More information

ADAPTIVE METROPOLIS-HASTINGS SAMPLING, OR MONTE CARLO KERNEL ESTIMATION

ADAPTIVE METROPOLIS-HASTINGS SAMPLING, OR MONTE CARLO KERNEL ESTIMATION ADAPTIVE METROPOLIS-HASTINGS SAMPLING, OR MONTE CARLO KERNEL ESTIMATION CHRISTOPHER A. SIMS Abstract. A new algorithm for sampling from an arbitrary pdf. 1. Introduction Consider the standard problem of

More information

Exponential Random Graph Models for Social Networks

Exponential Random Graph Models for Social Networks Exponential Random Graph Models for Social Networks ERGM Introduction Martina Morris Departments of Sociology, Statistics University of Washington Departments of Sociology, Statistics, and EECS, and Institute

More information

Bayesian Spatiotemporal Modeling with Hierarchical Spatial Priors for fmri

Bayesian Spatiotemporal Modeling with Hierarchical Spatial Priors for fmri Bayesian Spatiotemporal Modeling with Hierarchical Spatial Priors for fmri Galin L. Jones 1 School of Statistics University of Minnesota March 2015 1 Joint with Martin Bezener and John Hughes Experiment

More information

Post-stratification and calibration

Post-stratification and calibration Post-stratification and calibration Thomas Lumley UW Biostatistics WNAR 2008 6 22 What are they? Post-stratification and calibration are ways to use auxiliary information on the population (or the phase-one

More information

Lecture 7: Linear Regression (continued)

Lecture 7: Linear Regression (continued) Lecture 7: Linear Regression (continued) Reading: Chapter 3 STATS 2: Data mining and analysis Jonathan Taylor, 10/8 Slide credits: Sergio Bacallado 1 / 14 Potential issues in linear regression 1. Interactions

More information

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Examples: Mixture Modeling With Cross-Sectional Data CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Mixture modeling refers to modeling with categorical latent variables that represent

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This module is part of the Memobust Handboo on Methodology of Modern Business Statistics 26 March 2014 Method: Statistical Matching Methods Contents General section... 3 Summary... 3 2. General description

More information

Analysis of Panel Data. Third Edition. Cheng Hsiao University of Southern California CAMBRIDGE UNIVERSITY PRESS

Analysis of Panel Data. Third Edition. Cheng Hsiao University of Southern California CAMBRIDGE UNIVERSITY PRESS Analysis of Panel Data Third Edition Cheng Hsiao University of Southern California CAMBRIDGE UNIVERSITY PRESS Contents Preface to the ThirdEdition Preface to the Second Edition Preface to the First Edition

More information

Missing Data Analysis for the Employee Dataset

Missing Data Analysis for the Employee Dataset Missing Data Analysis for the Employee Dataset 67% of the observations have missing values! Modeling Setup Random Variables: Y i =(Y i1,...,y ip ) 0 =(Y i,obs, Y i,miss ) 0 R i =(R i1,...,r ip ) 0 ( 1

More information

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a Chapter 3 Missing Data 3.1 Types of Missing Data ˆ Missing completely at random (MCAR) ˆ Missing at random (MAR) a ˆ Informative missing (non-ignorable non-response) See 1, 38, 59 for an introduction to

More information

Analysis of Incomplete Multivariate Data

Analysis of Incomplete Multivariate Data Analysis of Incomplete Multivariate Data J. L. Schafer Department of Statistics The Pennsylvania State University USA CHAPMAN & HALL/CRC A CR.C Press Company Boca Raton London New York Washington, D.C.

More information

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K. GAMs semi-parametric GLMs Simon Wood Mathematical Sciences, University of Bath, U.K. Generalized linear models, GLM 1. A GLM models a univariate response, y i as g{e(y i )} = X i β where y i Exponential

More information

Semiparametric Mixed Effecs with Hierarchical DP Mixture

Semiparametric Mixed Effecs with Hierarchical DP Mixture Semiparametric Mixed Effecs with Hierarchical DP Mixture R topics documented: April 21, 2007 hdpm-package........................................ 1 hdpm............................................ 2 hdpmfitsetup........................................

More information

Missing Data. Where did it go?

Missing Data. Where did it go? Missing Data Where did it go? 1 Learning Objectives High-level discussion of some techniques Identify type of missingness Single vs Multiple Imputation My favourite technique 2 Problem Uh data are missing

More information

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES UNIVERSITY OF GLASGOW MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES by KHUNESWARI GOPAL PILLAY A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Overview of Part Two Probabilistic Graphical Models Part Two: Inference and Learning Christopher M. Bishop Exact inference and the junction tree MCMC Variational methods and EM Example General variational

More information

Missing Data Analysis with SPSS

Missing Data Analysis with SPSS Missing Data Analysis with SPSS Meng-Ting Lo (lo.194@osu.edu) Department of Educational Studies Quantitative Research, Evaluation and Measurement Program (QREM) Research Methodology Center (RMC) Outline

More information

NONPARAMETRIC REGRESSION WIT MEASUREMENT ERROR: SOME RECENT PR David Ruppert Cornell University

NONPARAMETRIC REGRESSION WIT MEASUREMENT ERROR: SOME RECENT PR David Ruppert Cornell University NONPARAMETRIC REGRESSION WIT MEASUREMENT ERROR: SOME RECENT PR David Ruppert Cornell University www.orie.cornell.edu/ davidr (These transparencies, preprints, and references a link to Recent Talks and

More information

STATISTICS (STAT) Statistics (STAT) 1

STATISTICS (STAT) Statistics (STAT) 1 Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).

More information

Blending of Probability and Convenience Samples:

Blending of Probability and Convenience Samples: Blending of Probability and Convenience Samples: Applications to a Survey of Military Caregivers Michael Robbins RAND Corporation Collaborators: Bonnie Ghosh-Dastidar, Rajeev Ramchand September 25, 2017

More information

Week 4: Simple Linear Regression III

Week 4: Simple Linear Regression III Week 4: Simple Linear Regression III Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Goodness of

More information

Modeling Criminal Careers as Departures From a Unimodal Population Age-Crime Curve: The Case of Marijuana Use

Modeling Criminal Careers as Departures From a Unimodal Population Age-Crime Curve: The Case of Marijuana Use Modeling Criminal Careers as Departures From a Unimodal Population Curve: The Case of Marijuana Use Donatello Telesca, Elena A. Erosheva, Derek A. Kreader, & Ross Matsueda April 15, 2014 extends Telesca

More information

Types of missingness and common strategies

Types of missingness and common strategies 9 th UK Stata Users Meeting 20 May 2003 Multiple imputation for missing data in life course studies Bianca De Stavola and Valerie McCormack (London School of Hygiene and Tropical Medicine) Motivating example

More information

Lecture 26: Missing data

Lecture 26: Missing data Lecture 26: Missing data Reading: ESL 9.6 STATS 202: Data mining and analysis December 1, 2017 1 / 10 Missing data is everywhere Survey data: nonresponse. 2 / 10 Missing data is everywhere Survey data:

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction A Monte Carlo method is a compuational method that uses random numbers to compute (estimate) some quantity of interest. Very often the quantity we want to compute is the mean of

More information

Markov Chain Monte Carlo (part 1)

Markov Chain Monte Carlo (part 1) Markov Chain Monte Carlo (part 1) Edps 590BAY Carolyn J. Anderson Department of Educational Psychology c Board of Trustees, University of Illinois Spring 2018 Depending on the book that you select for

More information

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always

More information

BeviMed Guide. Daniel Greene

BeviMed Guide. Daniel Greene BeviMed Guide Daniel Greene 1 Introduction BeviMed [1] is a procedure for evaluating the evidence of association between allele configurations across rare variants, typically within a genomic locus, and

More information

Multiple Imputation with Mplus

Multiple Imputation with Mplus Multiple Imputation with Mplus Tihomir Asparouhov and Bengt Muthén Version 2 September 29, 2010 1 1 Introduction Conducting multiple imputation (MI) can sometimes be quite intricate. In this note we provide

More information

Overview. Monte Carlo Methods. Statistics & Bayesian Inference Lecture 3. Situation At End Of Last Week

Overview. Monte Carlo Methods. Statistics & Bayesian Inference Lecture 3. Situation At End Of Last Week Statistics & Bayesian Inference Lecture 3 Joe Zuntz Overview Overview & Motivation Metropolis Hastings Monte Carlo Methods Importance sampling Direct sampling Gibbs sampling Monte-Carlo Markov Chains Emcee

More information

Quantitative Biology II!

Quantitative Biology II! Quantitative Biology II! Lecture 3: Markov Chain Monte Carlo! March 9, 2015! 2! Plan for Today!! Introduction to Sampling!! Introduction to MCMC!! Metropolis Algorithm!! Metropolis-Hastings Algorithm!!

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU. SOS3003 Applied data analysis for social science Lecture note 04-2009 Erling Berge Department of sociology and political science NTNU Erling Berge 2009 1 Missing data Literature Allison, Paul D 2002 Missing

More information

HANDLING MISSING DATA

HANDLING MISSING DATA GSO international workshop Mathematic, biostatistics and epidemiology of cancer Modeling and simulation of clinical trials Gregory GUERNEC 1, Valerie GARES 1,2 1 UMR1027 INSERM UNIVERSITY OF TOULOUSE III

More information

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Feature Selection Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 239 / 262 What is Feature Selection? Department Biosysteme Karsten Borgwardt Data Mining Course Basel

More information

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999.

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999. 2 Schafer, J. L., Graham, J. W.: (2002). Missing Data: Our View of the State of the Art. Psychological methods, 2002, Vol 7, No 2, 47 77 Rosner, B. (2005) Fundamentals of Biostatistics, 6th ed, Wiley.

More information

DATA ANALYSIS USING HIERARCHICAL GENERALIZED LINEAR MODELS WITH R

DATA ANALYSIS USING HIERARCHICAL GENERALIZED LINEAR MODELS WITH R DATA ANALYSIS USING HIERARCHICAL GENERALIZED LINEAR MODELS WITH R Lee, Rönnegård & Noh LRN@du.se Lee, Rönnegård & Noh HGLM book 1 / 24 Overview 1 Background to the book 2 Crack growth example 3 Contents

More information

Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data

Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data Marco Di Zio, Stefano Falorsi, Ugo Guarnera, Orietta Luzi, Paolo Righi 1 Introduction Imputation is the commonly used

More information

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models Prof. Daniel Cremers 4. Probabilistic Graphical Models Directed Models The Bayes Filter (Rep.) (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) 2 Graphical Representation (Rep.) We can describe the overall

More information

Introduction to Mplus

Introduction to Mplus Introduction to Mplus May 12, 2010 SPONSORED BY: Research Data Centre Population and Life Course Studies PLCS Interdisciplinary Development Initiative Piotr Wilk piotr.wilk@schulich.uwo.ca OVERVIEW Mplus

More information

Rolling Markov Chain Monte Carlo

Rolling Markov Chain Monte Carlo Rolling Markov Chain Monte Carlo Din-Houn Lau Imperial College London Joint work with Axel Gandy 4 th July 2013 Predict final ranks of the each team. Updates quick update of predictions. Accuracy control

More information

A GENERAL GIBBS SAMPLING ALGORITHM FOR ANALYZING LINEAR MODELS USING THE SAS SYSTEM

A GENERAL GIBBS SAMPLING ALGORITHM FOR ANALYZING LINEAR MODELS USING THE SAS SYSTEM A GENERAL GIBBS SAMPLING ALGORITHM FOR ANALYZING LINEAR MODELS USING THE SAS SYSTEM Jayawant Mandrekar, Daniel J. Sargent, Paul J. Novotny, Jeff A. Sloan Mayo Clinic, Rochester, MN 55905 ABSTRACT A general

More information

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

Week 10: Heteroskedasticity II

Week 10: Heteroskedasticity II Week 10: Heteroskedasticity II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Dealing with heteroskedasticy

More information

Missing data analysis. University College London, 2015

Missing data analysis. University College London, 2015 Missing data analysis University College London, 2015 Contents 1. Introduction 2. Missing-data mechanisms 3. Missing-data methods that discard data 4. Simple approaches that retain all the data 5. RIBG

More information

Approximate Bayesian Computation. Alireza Shafaei - April 2016

Approximate Bayesian Computation. Alireza Shafaei - April 2016 Approximate Bayesian Computation Alireza Shafaei - April 2016 The Problem Given a dataset, we are interested in. The Problem Given a dataset, we are interested in. The Problem Given a dataset, we are interested

More information

Bootstrapping Method for 14 June 2016 R. Russell Rhinehart. Bootstrapping

Bootstrapping Method for  14 June 2016 R. Russell Rhinehart. Bootstrapping Bootstrapping Method for www.r3eda.com 14 June 2016 R. Russell Rhinehart Bootstrapping This is extracted from the book, Nonlinear Regression Modeling for Engineering Applications: Modeling, Model Validation,

More information

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points] CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, 2015. 11:59pm, PDF to Canvas [100 points] Instructions. Please write up your responses to the following problems clearly and concisely.

More information

Scalable Bayes Clustering for Outlier Detection Under Informative Sampling

Scalable Bayes Clustering for Outlier Detection Under Informative Sampling Scalable Bayes Clustering for Outlier Detection Under Informative Sampling Based on JMLR paper of T. D. Savitsky Terrance D. Savitsky Office of Survey Methods Research FCSM - 2018 March 7-9, 2018 1 / 21

More information

Latent variable transformation using monotonic B-splines in PLS Path Modeling

Latent variable transformation using monotonic B-splines in PLS Path Modeling Latent variable transformation using monotonic B-splines in PLS Path Modeling E. Jakobowicz CEDRIC, Conservatoire National des Arts et Métiers, 9 rue Saint Martin, 754 Paris Cedex 3, France EDF R&D, avenue

More information

Rolling Markov Chain Monte Carlo

Rolling Markov Chain Monte Carlo Rolling Markov Chain Monte Carlo Din-Houn Lau Imperial College London Joint work with Axel Gandy 4 th September 2013 RSS Conference 2013: Newcastle Output predicted final ranks of the each team. Updates

More information

Bayesian Statistics Group 8th March Slice samplers. (A very brief introduction) The basic idea

Bayesian Statistics Group 8th March Slice samplers. (A very brief introduction) The basic idea Bayesian Statistics Group 8th March 2000 Slice samplers (A very brief introduction) The basic idea lacements To sample from a distribution, simply sample uniformly from the region under the density function

More information

Variability in Annual Temperature Profiles

Variability in Annual Temperature Profiles Variability in Annual Temperature Profiles A Multivariate Spatial Analysis of Regional Climate Model Output Tamara Greasby, Stephan Sain Institute for Mathematics Applied to Geosciences, National Center

More information

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background An Introduction to Multiple Imputation and its Application Craig K. Enders University of California - Los Angeles Department of Psychology cenders@psych.ucla.edu Background Work supported by Institute

More information

Hierarchical Bayesian Modeling with Ensemble MCMC. Eric B. Ford (Penn State) Bayesian Computing for Astronomical Data Analysis June 12, 2014

Hierarchical Bayesian Modeling with Ensemble MCMC. Eric B. Ford (Penn State) Bayesian Computing for Astronomical Data Analysis June 12, 2014 Hierarchical Bayesian Modeling with Ensemble MCMC Eric B. Ford (Penn State) Bayesian Computing for Astronomical Data Analysis June 12, 2014 Simple Markov Chain Monte Carlo Initialise chain with θ 0 (initial

More information

DATA ANALYSIS USING HIERARCHICAL GENERALIZED LINEAR MODELS WITH R

DATA ANALYSIS USING HIERARCHICAL GENERALIZED LINEAR MODELS WITH R DATA ANALYSIS USING HIERARCHICAL GENERALIZED LINEAR MODELS WITH R Lee, Rönnegård & Noh LRN@du.se Lee, Rönnegård & Noh HGLM book 1 / 25 Overview 1 Background to the book 2 A motivating example from my own

More information

Exam Issued: May 29, 2017, 13:00 Hand in: May 29, 2017, 16:00

Exam Issued: May 29, 2017, 13:00 Hand in: May 29, 2017, 16:00 P. Hadjidoukas, C. Papadimitriou ETH Zentrum, CTL E 13 CH-8092 Zürich High Performance Computing for Science and Engineering II Exam Issued: May 29, 2017, 13:00 Hand in: May 29, 2017, 16:00 Spring semester

More information

Bayes Estimators & Ridge Regression

Bayes Estimators & Ridge Regression Bayes Estimators & Ridge Regression Readings ISLR 6 STA 521 Duke University Merlise Clyde October 27, 2017 Model Assume that we have centered (as before) and rescaled X o (original X) so that X j = X o

More information