Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

Similar documents
Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Generalized Additive Model

Missing Data. Where did it go?

Handling Data with Three Types of Missing Values:

MISSING DATA AND MULTIPLE IMPUTATION

Analysis of Complex Survey Data with SAS

Tips on JMP ing into Mixture Experimentation

NEAREST NEIGHBOR HOT-DECK IMPUTATION FOR MISSING VALUES WITH SAS/IML

- 1 - Fig. A5.1 Missing value analysis dialog box

Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research

Statistics & Analysis. A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects

A Format to Make the _TYPE_ Field of PROC MEANS Easier to Interpret Matt Pettis, Thomson West, Eagan, MN

Statistics and Data Analysis. Common Pitfalls in SAS Statistical Analysis Macros in a Mass Production Environment

The G3GRID Procedure. Overview CHAPTER 30

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Generalized Additive Models

A New Method of Using Polytomous Independent Variables with Many Levels for the Binary Outcome of Big Data Analysis

Lecture 26: Missing data

Missing Data: What Are You Missing?

Extending the Scope of Custom Transformations

Loading Data. Introduction. Understanding the Volume Grid CHAPTER 2

Penetrating the Matrix Justin Z. Smith, William Gui Zupko II, U.S. Census Bureau, Suitland, MD

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

SAS/STAT 13.1 User s Guide. The NESTED Procedure

Fathom Dynamic Data TM Version 2 Specifications

Getting Correct Results from PROC REG

Missing Data? A Look at Two Imputation Methods Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA

Effects of PROC EXPAND Data Interpolation on Time Series Modeling When the Data are Volatile or Complex

Introduction to Statistical Analyses in SAS

Statistical matching: conditional. independence assumption and auxiliary information

Online Supplementary Appendix for. Dziak, Nahum-Shani and Collins (2012), Multilevel Factorial Experiments for Developing Behavioral Interventions:

Choosing the Right Procedure

Work Session on Statistical Data Editing (Paris, France, April 2014) Topic (v): International Collaboration and Software & Tools

Analysis of Imputation Methods for Missing Data. in AR(1) Longitudinal Dataset

SAS Structural Equation Modeling 1.3 for JMP

SAS/STAT 14.2 User s Guide. The SIMNORMAL Procedure

SD10 A SAS MACRO FOR PERFORMING BACKWARD SELECTION IN PROC SURVEYREG

Missing Data Analysis for the Employee Dataset

Hot-deck Imputation with SAS Arrays and Macros for Large Surveys

11.0 APPENDIX-B: COMPUTATION

CREATING THE ANALYSIS

Robust Linear Regression (Passing- Bablok Median-Slope)

Amelia multiple imputation in R

Multiple imputation using chained equations: Issues and guidance for practice

Missing Data Missing Data Methods in ML Multiple Imputation

Factorial ANOVA. Skipping... Page 1 of 18

Using PROC REPORT to Cross-Tabulate Multiple Response Items Patrick Thornton, SRI International, Menlo Park, CA

Enterprise Miner Software: Changes and Enhancements, Release 4.1

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

Paper SDA-11. Logistic regression will be used for estimation of net error for the 2010 Census as outlined in Griffin (2005).

Intermediate SAS: Statistics

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

SAS/STAT 13.1 User s Guide. The SURVEYFREQ Procedure

Semi-Supervised Clustering with Partial Background Information

A Fast Multivariate Nearest Neighbour Imputation Algorithm

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Contents of SAS Programming Techniques

An introduction to plotting data

Ranking Between the Lines

A Side of Hash for You To Dig Into

Missing Data in Orthopaedic Research

Bootstrap and multiple imputation under missing data in AR(1) models

The SIMNORMAL Procedure (Chapter)

The NESTED Procedure (Chapter)

Simulating Multivariate Normal Data

SAS/STAT 14.2 User s Guide. The SURVEYIMPUTE Procedure

2014 Stat-Ease, Inc. All Rights Reserved.

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

Handling Numeric Representation SAS Errors Caused by Simple Floating-Point Arithmetic Computation Fuad J. Foty, U.S. Census Bureau, Washington, DC

Using SAS Macros to Extract P-values from PROC FREQ

JMP Clinical. Release Notes. Version 5.0

A SAS Macro to Generate Caterpillar Plots. Guochen Song, i3 Statprobe, Cary, NC

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

CHAPTER 1 INTRODUCTION

Correcting for natural time lag bias in non-participants in pre-post intervention evaluation studies

Handling missing data for indicators, Susanne Rässler 1

Processing Missing Values with Self-Organized Maps

PROC MEANS for Disaggregating Statistics in SAS : One Input Data Set and One Output Data Set with Everything You Need

A SAS MACRO for estimating bootstrapped confidence intervals in dyadic regression

A Strip Plot Gets Jittered into a Beeswarm

3 Nonlinear Regression

Comparison of Hot Deck and Multiple Imputation Methods Using Simulations for HCSDB Data

Small Sample Equating: Best Practices using a SAS Macro

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Missing data a data value that should have been recorded, but for some reason, was not. Simon Day: Dictionary for clinical trials, Wiley, 1999.

Latent Curve Models. A Structural Equation Perspective WILEY- INTERSCIENΠKENNETH A. BOLLEN

Using PROC PLAN for Randomization Assignments

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

Centering and Interactions: The Training Data

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background

Robust Imputation of Missing Values in Compositional Data Using the -Package robcompositions

The Basics of PROC FCMP. Dachao Liu Northwestern Universtiy Chicago

Epidemiological analysis PhD-course in epidemiology

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

CREATING THE DISTRIBUTION ANALYSIS

Missing Data and Imputation

Transcription:

Paper CC-016 A macro for nearest neighbor Lung-Chang Chien, University of North Carolina at Chapel Hill, Chapel Hill, NC Mark Weaver, Family Health International, Research Triangle Park, NC ABSTRACT SAS software only has one data procedure named PROC MI for multiple to provide an easier way to impute missing data. However, it requires sufficient background to understand how to use it properly. In addition, multiple needs strong assumption of data profile, but sometimes practical data cannot match it. This paper describes an alternative data method named nearest neighbor, and also provides an easy SAS macro along with a simulated data example and a case study. INTRODUCTION Missing data problem is a very common situation in practical studies. Most SAS procedures require the assumption that the missing data mechanism is missing completely at random (MCAR) or missing at random (MAR), which allows one to ignore those records with missing data (i.e., casewise deletion ). However, this approach can be somewhat wasteful because we also eliminate other non-missing variables for any subject who only has missingness in a single variable. This is clearly inefficient if those non-missing variables are included in model fitting or statistical analysis. Moreover, some methods require strong assumptions regarding the data distribution, biases result if those assumptions are violated. Therefore, one could consider applying a data method with less restriction to make the whole data set more complete and use all available information for non-missing variables. The nearest neighbor is a kind of hot deck with a long history, and has been used in many surveys conducted by Statistics in Canada, the U.S. Bureau of Labor Statistics, and the U.S. Census Bureau. However, its statistical properties had not been derived until Chen and Shao (2001). As described by Little and Rubin (1987), all methods also require the missing data to be at least MAR, if not MCAR; the same is clearly true for the nearest neighbor approaches described here. This paper provides a brief introduction to nearest neighbor, provides an easy-to-use SAS macro fro the method, and demonstrates its use in two examples using simulated data and case study, respectively. The introduction explains the purpose and scope of your paper and provides readers with any general information they need to understand your paper. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by A missing value, j=n-m+1,,n, is imputed by choosing that value, i=1,,n-m, which is corresponding to its closest value to. This is also the true meaning of nearest neighbor. The definition of closest is determined by the distance between any two response values. In other words, the distance of the nearest neighborhood is calculated with all observed values for Y by When we find the response value, k=1,,n-m, which is the closest one to, j=1,,n, then we can impute its corresponding to missing value 1

If there are more than one whose corresponding response values has the same minimum distance to among others, then the mean of those X-values is imputed. SIMULATION DATA In this paper, a simulated data set will be used to apply a macro program with nearest neighbor. In this simulated data set, there are two variables named X and Y, respectively. In the beginning, both variables are all complete, and the sample size is 200. X is generated from uniform distribution U(0,1), and Y is generated from the following polynomial equation: Then, a random sampling can be applied to determine missing data randomly. The following process can generate an incomplete data set with missing values in X which mechanism follows MCAR. PROC SURVEYSELECT DATA=SIMDATA OUT=SIMDATA_SRS METHOD=SRS SAMPSIZE=190 SEED=1234; DATA SIMDATA_SRS; SET SIMDATA_SRS; NOMISS=1; DATA SIMDATA_MISS; MERGE SIMDATA SIMDATA_SRS; BY ID; IF NOMISS=. THEN X=.; DROP NOMISS SELECTIONPROB SAMPLINGWEIGHT; The missing rate in this simulated data set is defined as 5%, so 5 X-values are missing by the above program. MACRO A macro named %NNI will be introduced in this section. The call for the macro %NNI is as follows: %NNI(INDATA=, MISSVAR=, RESPVAR=, IDVAR=, OUTDATA=); There are five arguments in this macro. 1. INDATA: the name of input data set, including library name. 2. MISSVAR: scalar variable that contains some missing values to be imputed. 3. RESPVAR: response variable. If there is no specific response variable, any variable which is other than the variable in MISSVAR is allowable. If there are more than two choices, the variable with the highest correlation coefficient with the variable in MISSVAR is the best choice. 4. IDVAR: variable name of subject ID. It is recommended to include this variable in data set if there is no ID-like variable. 5. OUTDATA: name final data set with imputed data. The following code is the application of using %NNI in previous simulation data. %NNI(INDATA=SIM.SIMDATA_MISS, MISSVAR=X, RESPVAR=Y, IDVAR=ID, OUTDATA=SIM.SIMDATA_NNI); The scatter plot in figure 1 shows that imputed data tends to be very close to the unobserved missing value based on observed X and Y. The demographics of X also show the profile is similar to X s true descriptive statistics after using nearest neighbor. 2

Summary statistics for X True value Before After Mean 0.5181 0.5155 0.5172 STD 0.2942 0.2961 0.2941 Min 0.0009 0.0009 0.0009 Q1 0.2721 0.2622 0.2721 Median 0.5537 0.5537 0.5537 Q3 0.7686 0.7637 0.7646 Max 0.9967 0.9967 0.9967 Figure 1. The scatter plot of Y with observed, missing and imputed values of X CASE STUDY A case study about daily temperature and centralized particulate matter < 2.5 m in aerodynamic diameter (PM2.5) in Raleigh is applied here. The source of the data set is the National Morbidity Mortality Air Pollution Study (NMMAPS) which is maintained by the Department of Biostatistics in Johns Hopkins University. The period of data is from 1987 to 2000, but only 2000 s data are used here. Each variable has 366 observations. Daily temperature data are complete, but PM2.5 has 17 missing values. The missing rate is 4.64%. The following codes are used to impute missing data: %NNI(INDATA=CASE.RALFORNNI, MISSVAR=PM25TMEAN, RESPVAR=TMPD, IDVAR=DATE, OUTDATA=CASE.RAL_NNI); All imputed data are generated by known information of daily temperature, and plotted by red dots in figure 2. The demographics for PM2.5 before- and after- are very similar. The correlation between PM2.5 and daily temperature before is 0.3113, and it s 0.3136 after. Suppose this data set is used to be fit by generalized additive model (GAM), we can show how influence of imputing data by the nearest neighbor in model fitting. Summary statistics for PM2.5 Before After Mean 0.2297 0.2148 STD 7.6452 7.5347 Min -12.1366-12.1366 Q1-5.3750-5.2193 Median -0.6531-0.6257 Q3 4.7307 4.6272 Max 36.8448 36.8448 Figure 2. The scatter plot of daily temperature with observed and imputed centralized PM2.5. 3

Parameter Before Regression Model Analysis Parameter Estimates Parameter Estimate Standard Error t Value Pr > t Intercept 58.27474 0.76187 76.49 <.0001 pm25tmean 0.33609 0.04986 6.74 <.0001 Linear(t) 0.00602 0.00362 1.66 0.0974 Source Smoothing Model Analysis Analysis of Deviance DF Sum of Squares Chi-Square Pr > ChiSq Spline(t) 6.0000 66097 1321.3932 <.0001 Parameter After Regression Model Analysis Parameter Estimates Parameter Estimate Standard Error t Value Pr > t Intercept 57.70478 0.73717 78.28 <.0001 pm25tmean 0.33837 0.04900 6.91 <.0001 Linear(t) 0.00821 0.00349 2.35 0.0192 Source Smoothing Model Analysis Analysis of Deviance DF Sum of Squares Chi-Square Pr > ChiSq Spline(t) 6.0000 70078 1422.1678 <.0001 The model is defined as: Table 1. Outputs of PROC GAM before and after The results of model fitting are shown in table 1. The estimated coefficient of PM2.5 after is a little bit higher than that before ; however, its standard error is reduced with imputed data. The estimated smoothing function of time with imputed PM2.5 data is also similar as that without imputed PM2.5 data (figure 3). LIMITATION In original methodology of nearest neighbor, there is no special restriction. Both continuous and categorical variable can apply it. However, this macro is not suitable for imputed missing categorical data because if there are more than one closest observed response for a missing datum, it is likely to generate imputed values with decimals. A modified approach is using an additional data step to round imputed data into integers. In addition, even though there is no study to support how large data set it can support, but too small sample size may cause problems. A trial reminder is that there should be at least one variable that is totally complete without any missing value. Even though this complete variable is not a response variable in analysis, you can assume it is. If there are more than one complete variable, you can easily use PROC CORR procedure to make a correlation matrix among them, and pick the complete variables with the highest correlation with missing variable. Figure 3. Estimated smoothing functions with 95% confidence interval before and after using. 4

REFERENCES Chen, J., and Shao, J. (2000). Nearest neighbor for survey data, Journal of Official Statistics, Vol.16, No.2, 113-131. Little, R. J. A., and Rubin, D. B. (1987). Statistical Analysis with Missing Data. New York: John Wiley & Sons. Nittner, T. (2002). The additive model with missing values in the independent variable theory and simulation, Collaborative Research Center 386, Discussion Paper 272. Van Hulse, J., and Khoshgoftaar T.M. (2007). Incomplete-case nearest neighbor in software measurement data, Proceedings of the IEEE International Conference on Information Reused Integration (IRI 2007), 630-637. SAS Institute Inc., SAS Macro Language: Reference, Version 8, Cary, NC: SAS Institute, Inc., 1999. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Lung-Chang Chien University of North Carolina at Chapel Hill Chapel Hill, NC 27599 E-mail: cchien@email.unc.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. APPENDIX %MACRO NNI(INDATA=, /*INPUT DATA SET(INCLUDING LIBRARY NAME) */ MISSVAR=, /*INPUT SINGLE VARIABLE WITH MISSING DATA*/ RESPVAR=, /*RESPONSE VARIABLE*/ IDVAR=, /*SUBJECT VARIABLE*/ OUTDATA= /*OUTPUT DATA SET WITH COMPLETE DATA*/ ); DATA OBSY_MISSX; SET &INDATA; IF &MISSVAR=.; KEEP &IDVAR &RESPVAR; PROC MEANS DATA=&INDATA N NMISS NOPRINT; VAR &MISSVAR; OUTPUT OUT=OBSN N=OBSX NMISS=MISSX; DATA _NULL_; SET OBSN; CALL SYMPUT('OBSN',OBSX); CALL SYMPUT('MISSN',MISSX); DATA SIMDATA_NOMISS; SET &INDATA; IF &MISSVAR NE.; PROC IML; USE SIMDATA_NOMISS; READ ALL VAR {&MISSVAR} INTO XMAT; READ ALL VAR {&RESPVAR} INTO YMAT; 5

USE OBSY_MISSX; READ ALL VAR {&IDVAR &RESPVAR} INTO MISSMAT; TXMAT=T(XMAT); TYMAT=T(YMAT); DISTANCE=J(&MISSN,&OBSN,.); MIND=J(&MISSN,1,.); MISSVAR=J(&MISSN,&OBSN,.); IMP=J(&MISSN,1,.); DO I = 1 TO &MISSN; DO J = 1 TO &OBSN; DISTANCE[I,J]=ABS(MISSMAT[I,2]-TYMAT[,J]); MIND[I,]=MIN(DISTANCE[I,]); DO I = 1 TO &MISSN; DO J = 1 TO &OBSN; IF DISTANCE[I,J]=MIND[I,] THEN MISSVAR[I,J]=TXMAT[,J]; IMP[I,]=MISSVAR[I,:]; CNAME={"&IDVAR" "&RESPVAR" "&MISSVAR"}; IMPX=MISSMAT IMP; CREATE NNI FROM IMPX[C=CNAME]; APPEND FROM IMPX; QUIT; DATA &OUTDATA; MERGE &INDATA NNI; BY &IDVAR; %M 6