Smoking and Missingness: Computer Syntax 1

Similar documents
Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Epidemiological analysis PhD-course in epidemiology

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

MISSING DATA AND MULTIPLE IMPUTATION

Correctly Compute Complex Samples Statistics

- 1 - Fig. A5.1 Missing value analysis dialog box

3.6 Sample code: yrbs_data <- read.spss("yrbs07.sav",to.data.frame=true)

11.0 APPENDIX-B: COMPUTATION

Calculating measures of biological interaction

Package binomlogit. February 19, 2015

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

STA 570 Spring Lecture 5 Tuesday, Feb 1

AMELIA II: A Program for Missing Data

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Motivating Example. Missing Data Theory. An Introduction to Multiple Imputation and its Application. Background

The SAS interface is shown in the following screen shot:

Multiple-imputation analysis using Stata s mi command

An introduction to SPSS

The brlr Package. March 22, brlr... 1 lizards Index 5. Bias-reduced Logistic Regression

Introduction to Mplus

Introduction to Statistical Analyses in SAS

15 Wyner Statistics Fall 2013

Contents of SAS Programming Techniques

Missing Data: What Are You Missing?

Package GWAF. March 12, 2015

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Data Management - 50%

Stat 5100 Handout #14.a SAS: Logistic Regression

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

Lab #9: ANOVA and TUKEY tests

A. Using the data provided above, calculate the sampling variance and standard error for S for each week s data.

Correctly Compute Complex Samples Statistics

Activity Overview A basic introduction to the many features of the calculator function of the TI-Nspire

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1

Zero-Inflated Poisson Regression

Data analysis using Microsoft Excel

A. Incorrect! This would be the negative of the range. B. Correct! The range is the maximum data value minus the minimum data value.

Fathom Dynamic Data TM Version 2 Specifications

JMP Book Descriptions

SAS/STAT 13.1 User s Guide. The NESTED Procedure

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT

A User Manual for the Multivariate MLE Tool. Before running the main multivariate program saved in the SAS file Part2-Main.sas,

SAS/STAT 14.2 User s Guide. The SURVEYIMPUTE Procedure

Generalized least squares (GLS) estimates of the level-2 coefficients,

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

BACKGROUND INFORMATION ON COMPLEX SAMPLE SURVEYS

Multiple imputation using chained equations: Issues and guidance for practice

rasch: A SAS macro for fitting polytomous Rasch models

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

STAT 7000: Experimental Statistics I

Acknowledgments. Acronyms

CSE 123 Introduction to Computing

Statements with the Same Function in Multiple Procedures

Ben Baumer Instructor

SAS/STAT 14.3 User s Guide The SURVEYFREQ Procedure

S2 Text. Instructions to replicate classification results.

CORExpress 1.0 USER S GUIDE Jay Magidson

Random Sampling For the Non-statistician Diane E. Brown AdminaStar Solutions, Associated Insurance Companies Inc.

Package mcemglm. November 29, 2015

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

The DMINE Procedure. The DMINE Procedure

CHAPTER 1 INTRODUCTION

SLStats.notebook. January 12, Statistics:

Package samplesizelogisticcasecontrol

Handbook of Statistical Modeling for the Social and Behavioral Sciences

STATA 13 INTRODUCTION

The SAS %BLINPLUS Macro

SOFTWARE TOOLS FOR OPTIMAL TWO-STAGE SAMPLING A USER GUIDE

Generalized Linear Models in Lisp-Stat. Luke Tierney. September 1994

[/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n} {NOT} {NODF} {NOPROB}] {NOCOUNTS} {NOMEANS}

Statistics & Analysis. A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects

The NESTED Procedure (Chapter)

Module 25.1: nag lin reg Regression Analysis. Contents

Package InformativeCensoring

Structured Data, LLC RiskAMP User Guide. User Guide for the RiskAMP Monte Carlo Add-in

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

Missing Data in Orthopaedic Research

Using Mplus Monte Carlo Simulations In Practice: A Note On Non-Normal Missing Data In Latent Variable Models

Mastery. PRECALCULUS Student Learning Targets

Supplementary Notes on Multiple Imputation. Stephen du Toit and Gerhard Mels Scientific Software International

Ivy s Business Analytics Foundation Certification Details (Module I + II+ III + IV + V)

An Introduction to Model-Fitting with the R package glmm

SAS Graphics Macros for Latent Class Analysis Users Guide

GRAPHING CALCULATOR REFERENCE BOOK

Estimation of Item Response Models

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

September 11, Unit 2 Day 1 Notes Measures of Central Tendency.notebook

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

Excel R Tips. is used for multiplication. + is used for addition. is used for subtraction. / is used for division

/********************************************/ /* Evaluating the PS distribution!!! */ /********************************************/

PROC JACK REG - A SAS~ PROCEDURE FOR JACKKNIFE REGRESSION Panayiotis Hambi, West Virginia University

Handling missing data for indicators, Susanne Rässler 1

Section 2.3 (e-book 4.1 & 4.2) Rational Functions

Centering and Interactions: The Training Data

WarmFusion Reference Manual

Approximation methods and quadrature points in PROC NLMIXED: A simulation study using structured latent curve models

Transcription:

Smoking and Missingness: Computer Syntax 1 Computer Syntax SAS code is provided for the logistic regression imputation described in this article. This code is listed in parts, with description provided for each part, however, in a given analysis these parts would be combined into a single syntax file. This syntax file can be obtained at www.uic.edu/ hedeker/long.html. To start, the code below reads in the dataset used in this article. DATA one; INFILE c:\smoke.dat ; INPUT id smk miss smk0 grp; Here, id is the subject identifier, smk is the smoking status at the final timepoint (0=abstinent, 1=smoking,.=missing), miss is the missing indicator (0=observed or 1=missing at the final timepoint), smk0 is the t0 smoking status (0=abstinent, 1=smoking), and grp is the grouping variable (0=control, 1=treatment). Uppercase letters are used to designate SAS syntax, and lowercase letters are used to designate user-defined entities. Multiple Imputation Several variables need to be defined for the subsequent logistic regression multiple imputation model. First, we designate the cell frequencies for the observed cells in the crosstab of smk0 by miss by smk (i.e., the cell indicators forming the suffix of the n variables reflect the values of these three binary variables in this particular order). Note that these frequencies could be obtained in the present analysis, but, for simplicity, here they have been gotten from a previous analysis of these data and simply typed in. Comments are provided on each line using the /*... */ convention of SAS. n111 = 42; /* number of abstainers - smk0=abstinent */ n112 = 71; /* number of smokers - smk0=abstinent */ n211 = 36; /* number of abstainers - smk0=smoking */ n212 = 223; /* number of smokers - smk0=smoking */ Now, the mean values of the coefficients for the logistic regression are obtained based

Smoking and Missingness: Computer Syntax 2 on the above frequencies and the assumed level of the odds ratio for missing and smoking. orat = 2; beta0m = LOG(n112/n111); beta1m = LOG(orat); beta2m = LOG(n212/n211); beta3m = LOG(orat); seed = 974677743; The regression coefficients beta0m and beta2m are based on the smoking rates for t0 nonsmokers and t0 smokers, respectively. Similarly, the coefficients beta1m and beta3m are set depending on the assumed odds ratio of smoking and missing, for t0 non-smokers and smokers, respectively. Here, an odds ratio of 2 is assumed for both. The SAS function LOG computes the natural logarithm of the argument. The variable seed, which is an integer value that starts the subsequent random number functions, is also set. The cell frequencies for the missing cells are now based on the observed cell frequencies, the numbers of missing subjects (these numbers are simply typed in), and the assumed odds ratio. Again, these are for the crosstab of smk0 by miss by smk. n12dot = 37; /* number of missing for smk0=abstinent */ p122 = (orat*(n112/n111))/(1 + (orat*(n112/n111))); n122 = p122*n12dot; n121 = (1 - p122)*n12dot; n22dot = 80; /* number of missing for smk0=smoking */ p222 = (orat*(n212/n211))/(1 + (orat*(n212/n211))); n222 = p222*n22dot; n221 = (1 - p222)*n22dot; Based on the cell sample sizes, the variances associated with the regression coefficients are calculated (these formulas can be found in Agresti [2002]). These will be used in the imputation below in order to add sampling variation into the process (i.e., because

Smoking and Missingness: Computer Syntax 3 the regression coefficients above are estimates of the population parameters, sampling variation needs to be added to these regression coefficients in the imputation). beta0v = (n111 + n112)/(n111*n112); beta1v = 1/n111 + 1/n112 + 1/n121 + 1/n122; beta2v = (n211 + n212)/(n211*n212); beta3v = 1/n211 + 1/n212 + 1/n221 + 1/n222; Imputation is now done for the logistic regression model. As described in the article, it is important to perform this imputation multiple times. The code below can be used for this purpose, in this case it is done 100 times. To get a random draw from a standard logistic distribution, we use the fact that this distribution can be obtained as the natural logarithm of the ratio of two independent standard exponential distributions (see McCullagh and Nelder [1989], page 20). Random draws from standard exponential distributions are obtained using the SAS function RANEXP. Random draws from standard normal distributions are obtained using the SAS function RANNOR. As described in the article, the Cholesky factorization, or matrix square root, of the variance-covariance matrix associated with the regression coefficients is used, and applied to the standard random normal deviates that are obtained using RANNOR. For this, the SAS function SQRT is used to give the square root of the indicated argument. DATA sim; SET one; ARRAY smks(100) smks1-smks100; DO i = 1 TO 100; IF miss EQ 0 THEN smks(i) = smk; IF miss EQ 1 THEN DO; exp1 = RANEXP(seed); exp2 = RANEXP(seed); stdl = LOG(exp1/exp2); ran0 = RANNOR(seed); ran1 = RANNOR(seed); /* the next lines incorporate the covariance between beta0 and beta1 (likewise for beta2 and beta3) using the Cholesky factorization */ beta0 = beta0m + ran0*sqrt(beta0v);

Smoking and Missingness: Computer Syntax 4 beta1 = beta1m - ran0*sqrt(beta0v) + ran1*sqrt(beta1v - beta0v); ran2 = RANNOR(seed); ran3 = RANNOR(seed); beta2 = beta2m + ran2*sqrt(beta2v); beta3 = beta3m - ran2*sqrt(beta2v) + ran3*sqrt(beta3v - beta2v); ystar = (1 - smk0)*(beta0 + beta1*miss) + smk0*(beta2 + beta3*miss) + stdl; smks(i) = 0; IF ystar > 0 THEN smks(i) = 1; END; END; Here, a new dataset sim is created which will contain 100 smoking variables named smks1 to smks100. ADO loop is used to create these 100 variables, and the ARRAY statement is used to specify a vector named smks containing the 100 smoking repetitions. These are set to the original variable smk for observed individuals, and imputed otherwise. Analysis of Multiply-Imputed Data To analyze the multiply-imputed data, we first have to adjust the data so that they are in the long format. Namely, in the file sim, which is in the wide format, each of the 100 smoking variables are associated with one case, whereas, for the analyses to be performed, each needs to be a separate case, with a variable indicating the imputation number. The code below does this translation, yielding a variable smki that is the smoking variable, and the variable imputation that is the imputation number. These variables, and grp, are saved in the dataset unisim. DATA unisim (KEEP = id grp smki ARRAY smks(100) smks1-smks100; DO imputation = 1 TO 100; smki = smks( imputation ); OUTPUT; END; imputation ); SET sim;

Smoking and Missingness: Computer Syntax 5 The data are now sorted by imputation. PROC SORT; BY imputation ; The logistic regression analysis is performed, stratified by imputation (i.e., performed 100 times), and the results from each analysis are saved in the dataset outlreg. PROC LOGISTIC DESCENDING NOPRINT OUTEST=outlreg COVOUT; MODEL smki = grp / LINK = LOGIT; BY imputation ; The results corresponding to the regression coefficients (i.e., for intercept and grp) from the 100 logistic regression analyses are combined using PROC MIANALYZE. PROC MIANALYZE DATA=outlreg; VAR intercept grp; PROC MIANALYZE prints out the results of the multiple imputation process for the two logistic regression parameters intercept and grp. The test statistic and p-value of the latter are reported in Table 2 of the article.

Smoking and Missingness: Computer Syntax 6 References A. Agresti. Categorical Data Analysis, 2nd edition. Wiley, New York, 2002. P. McCullagh and J. A. Nelder. Generalized linear models (2nd edition). Chapman and Hall, New York, 1989.