Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Similar documents
Missing Data: What Are You Missing?

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

MISSING DATA AND MULTIPLE IMPUTATION

Missing Data Analysis for the Employee Dataset

Missing Data. Where did it go?

Missing Data Analysis with SPSS

Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES

Handling Data with Three Types of Missing Values:

Epidemiological analysis PhD-course in epidemiology

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

Missing data analysis. University College London, 2015

Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data

Missing Data Missing Data Methods in ML Multiple Imputation

Missing data analysis: - A study of complete case analysis, single imputation and multiple imputation. Filip Lindhfors and Farhana Morko

SENSITIVITY ANALYSIS IN HANDLING DISCRETE DATA MISSING AT RANDOM IN HIERARCHICAL LINEAR MODELS VIA MULTIVARIATE NORMALITY

Smoking and Missingness: Computer Syntax 1

Missing Data and Imputation

Missing Data Techniques

Longitudinal Modeling With Randomly and Systematically Missing Data: A Simulation of Ad Hoc, Maximum Likelihood, and Multiple Imputation Techniques

11.0 APPENDIX-B: COMPUTATION

Simulation Study: Introduction of Imputation. Methods for Missing Data in Longitudinal Analysis

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Handling missing data for indicators, Susanne Rässler 1

Usage of R in Offi cial Statistics Survey Data Analysis at the Statistical Offi ce of the Republic of Slovenia

Analysis of Imputation Methods for Missing Data. in AR(1) Longitudinal Dataset

Bootstrap and multiple imputation under missing data in AR(1) models

- 1 - Fig. A5.1 Missing value analysis dialog box

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

STA 490H1S Initial Examination of Data

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Data corruption, correction and imputation methods.

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

R software and examples

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

A STOCHASTIC METHOD FOR ESTIMATING IMPUTATION ACCURACY

Multiple Imputation with Mplus

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

CHAPTER 1 INTRODUCTION

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

The Use of Sample Weights in Hot Deck Imputation

IBM SPSS Missing Values 21

Evaluating Alternative Methods of Dealing with Missing Observations An Economic Application

NEAREST NEIGHBOR HOT-DECK IMPUTATION FOR MISSING VALUES WITH SAS/IML

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999.

Statistics, Data Analysis & Econometrics

Missing Data Analysis with the Mahalanobis Distance

Missing Data in Orthopaedic Research

SAS/STAT 14.2 User s Guide. The SURVEYIMPUTE Procedure

Types of missingness and common strategies

Lecture 26: Missing data

A Fast Multivariate Nearest Neighbour Imputation Algorithm

PASW Missing Values 18

Approaches to Missing Data

WELCOME! Lecture 3 Thommy Perlinger

Tools for Imputing Missing Data

Synthetic Data. Michael Lin

Missing Data Analysis for the Employee Dataset

Processing Missing Values with Self-Organized Maps

Lecture 3: Chapter 3

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Faculty of Sciences. Holger Cevallos Valdiviezo

arxiv: v1 [stat.ap] 8 Jan 2014

SC708: Hierarchical Linear Modeling Instructor: Natasha Sarkisian. Missing data

An imputation approach for analyzing mixed-mode surveys

Categorical missing data imputation for software cost estimation by multinomial logistic regression

P. Jönsson and C. Wohlin, "Benchmarking k-nearest Neighbour Imputation with Homogeneous Likert Data", Empirical Software Engineering: An

Cross-validation. Cross-validation is a resampling method.

Improving Imputation Accuracy in Ordinal Data Using Classification

Using Imputation Techniques to Help Learn Accurate Classifiers

Missing Data Part 1: Overview, Traditional Methods Page 1

Dual-Frame Weights (Landline and Cell) for the 2009 Minnesota Health Access Survey

A Side of Hash for You To Dig Into

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

Development of weighted model fit indexes for structural equation models using multiple imputation

Multiple imputation using chained equations: Issues and guidance for practice

Paper ST-157. Dennis J. Beal, Science Applications International Corporation, Oak Ridge, Tennessee

The Performance of Multiple Imputation for Likert-type Items with Missing Data

Statistical Matching using Fractional Imputation

ESTIMATING THE MISSING VALUES IN ANALYSIS OF VARIANCE TABLES BY A FLEXIBLE ADAPTIVE ARTIFICIAL NEURAL NETWORK AND FUZZY REGRESSION MODELS

ST Lab 1 - The basics of SAS

Variance Estimation in Presence of Imputation: an Application to an Istat Survey Data

ENTERPRISE MINER: 1 DATA EXPLORATION AND VISUALISATION

HANDLING MISSING DATA

Single Imputation Techniques and Their Limitations

Multicollinearity and Validation CIVL 7012/8012

Didacticiel - Études de cas

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

NORM software review: handling missing values with multiple imputation methods 1

Multiple-imputation analysis using Stata s mi command

Survey estimation under informative nonresponse with follow-up

Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models

Tools for Statistical Analysis with Missing Data: Application to a Large Medical Database

5.5 Regression Estimation

rpms: An R Package for Modeling Survey Data with Regression Trees

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Package midastouch. February 7, 2016

Transcription:

Simulation of Imputation Effects Under Different Assumptions Danny Rithy

ABSTRACT Missing data is something that we cannot always prevent. Data can be missing due to subjects' refusing to answer a sensitive question or in fear of embarrassment. Researchers often assume that their data are missing completely at random or missing at random." Unfortunately, we cannot test whether the mechanism condition is satisfied because missing values cannot be calculated. Alternatively, we can run simulation in SAS to observe the behaviors of missing data under different assumptions: missing completely at random, missing at random, and being ignorable. We compare the effects from imputation methods if we assign a set of variable of interests to missing. The idea of imputation is to see how efficient substituted values in a data set affect further studies. This lets the audience decide which method(s) would be best to approach a data set when it comes to missing data. MISSINGNESS MECHANISMS When a response is missing, But does not depend on any variable of interest observed in dataset, it is said to be Missing Completely at Random (MCAR). But does not depend on the variable of interest but conditionally depends on some of the variable observed in the dataset, it is said to be Missing at Random (MAR). And depends on the variable of interest, it is said to be Not Missing at Random (NMAR). SIMULATION STUDY For the simulation, MathAchieve dataset will be used where mathematics achievement score represents the response variable. The following steps summarizes on how to simulate missing data in SAS Macro: 1. Assign some respondent s answers to missing: MCAR is induced by taking a simple random sample of the observations for the variable of interest and assign them to missing. MAR takes a stratified sampling of the observations for the variable of interest and assign them to missing. NMAR uses a condition to assign the variable of interest to missing. 2. Use an imputation method to impute the missing values 3. Calculate Mean Square Error (MSE) SIMULATION RESULTS METHODS The main objective of imputation is to fill in missing values using variables with non-missing values prior to data analysis. Likewise Deletion: A method that simply removes any observations that contain at least one missing values. Note that this method does not require imputation. This can be done in a DATA step. Mean Imputation: A method that takes an average value of the variable of interest and imputes missing values. In SAS, this can be done by using PROC MEANS or PROC SUMMARY. Regression Imputation: A method that estimates a regression model and predicts missing values given a number of independent variables. In SAS, this can be done by using PROC REG. Hot Deck Imputation: A method for dealing with item nonresponse where every missing value is replaced with a value from a similar donor. In SAS, this can be done by using PROC SURVEYIMPUTE. Multiple Imputation: A method for repeating the stochastic imputation m times where each imputation represents random samples. In SAS, use the following procedures in order: PROC MI, PROC REG, and PROC MIANALYZE.

Comparing MSE in Series Plot As % missing increases, Mean Square Error (MSE) increases When the data are MCAR, MSE is the largest whereas MAR s MSE is the smallest MAR has MSE values that are lowest MAR is preferable to MCAR and NMAR Plot of Mean and Standard Deviation by the Amount of Missingness MCAR s mean are unbiased estimates and SD are underestimated MAR s mean are slightly overestimated and SD are underestimated NMAR s mean are overestimated and SD are underestimated LESSONS LEARNED Single variable imputation affects and decreases correlation and variability Multiple Imputation takes into account for variability due to stochastic error added per imputation When data are MCAR, Mean Imputation is appropriate When data are MAR, Regression Imputation & Multiple Imputation are appropriate When data are NMAR, advanced statistical models (e.g Structural Equations Modeling or MCMC) are preferable CONCLUSIONS It s difficult to determine whether each imputation method is the optimal choice because of unknown circumstances and uncertainty in missing data. The simulation results imply the more missingness in a dataset, the higher MSE will be for all imputation methods. ACKNOWLEDGEMENTS Statistics Department at Cal Poly Senior Project Advisor: Soma Roy REFERENCES Little, Roderick J. A., and Donald B. Rubin. Statistical Analysis with Missing Data. New York: Wiley, 1987. Print. Scheffer, Judi. "Dealing with Missing Data." Research Letters in the Information and Mathematical Sciences (2002): 153-60. Web. MISSING DATA MECHANISMS IN SAS MACRO %macro MDprocedure(dat=, percent=, mechanism=, rv=, catevar=, grp1=, grp2=); %if &mechanism = "MCAR" %then /* Use PROC SQL to get the number of samples from a specified dataset (dat) */ /* Use PROC SURVEYSELECT to randomly select obs from the dataset*/ %else %if &mechanism = MAR" %then /* Begin the DATA step and create two SAS data sets from a specified dataset (dat) to use the if-else statement based on categorical variable (catevar, grp1, and grp2) to output into one of two datasets*/ /* Use PROC SQL to get the number of samples from the selected dataset above */ /* Use PROC SURVEYSELECT to randomly select observations from the dataset*/ /* Concatenate two datasets from the DATA step into one data set*/ %else /* Begin the DATA step using a specified data set (dat) and set a condition in the if-else statement to output into one of the two datasets */ /* Use PROC SQL to get the number of samples using the specified dataset */ /* Use PROC SURVEYSELECT to randomly select observations from the dataset*/ /* Concatenate two datasets from the DATA step into one data set*/ %mend;