SAS Macros CORR_P and TANGO: Interval Estimation for the Difference Between Correlated Proportions in Dependent Samples

Similar documents
Paper PO-06. Gone are the days when social and behavioral science researchers should simply report obtained test statistics (e.g.

Biostat Methods STAT 5820/6910 Handout #4: Chi-square, Fisher s, and McNemar s Tests

High-Performance Procedures in SAS 9.4: Comparing Performance of HP and Legacy Procedures

Using SAS Macros to Extract P-values from PROC FREQ

Applied Multivariate Analysis

Box-Cox Transformation

Descriptive Statistics, Standard Deviation and Standard Error

Lab #3: Probability, Simulations, Distributions:

GEN_OMEGA2: The HPSUMMARY Procedure: A SAS Macro for Computing the Generalized Omega-Squared Effect Size Associated with

Strategies for Modeling Two Categorical Variables with Multiple Category Choices

Statistics and Data Analysis. Common Pitfalls in SAS Statistical Analysis Macros in a Mass Production Environment

A STATISTICAL INVESTIGATION INTO NONINFERIORITY TESTING FOR TWO BINOMIAL PROPORTIONS NICHOLAS BLOEDOW. B.S. University of Wisconsin-Oshkosh, 2011

DSCI 325: Handout 10 Summarizing Numerical and Categorical Data in SAS Spring 2017

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

SAS/STAT 13.1 User s Guide. The Power and Sample Size Application

Choosing the Right Procedure

Pair-Wise Multiple Comparisons (Simulation)

An Algorithm to Compute Exact Power of an Unordered RxC Contingency Table

STATS PAD USER MANUAL

Using PROC REPORT to Cross-Tabulate Multiple Response Items Patrick Thornton, SRI International, Menlo Park, CA

Information Criteria Methods in SAS for Multiple Linear Regression Models

Examining the Impact of Drifted Polytomous Anchor Items on Test Characteristic Curve (TCC) Linking and IRT True Score Equating

Fathom Dynamic Data TM Version 2 Specifications

Macros for Two-Sample Hypothesis Tests Jinson J. Erinjeri, D.K. Shifflet and Associates Ltd., McLean, VA

Box-Cox Transformation for Simple Linear Regression

Statistical Programming in SAS. From Chapter 10 - Programming with matrices and vectors - IML

SAS/STAT 13.1 User s Guide. The NESTED Procedure

An Efficient Method to Create Titles for Multiple Clinical Reports Using Proc Format within A Do Loop Youying Yu, PharmaNet/i3, West Chester, Ohio

A noninformative Bayesian approach to small area estimation

MIXED_RELIABILITY: A SAS Macro for Estimating Lambda and Assessing the Trustworthiness of Random Effects in Multilevel Models

Let s Get FREQy with our Statistics: Data-Driven Approach to Determining Appropriate Test Statistic

ABC Macro and Performance Chart with Benchmarks Annotation

The Power and Sample Size Application

Frequentist and Bayesian Interim Analysis in Clinical Trials: Group Sequential Testing and Posterior Predictive Probability Monitoring Using SAS

SAS/STAT 13.1 User s Guide. The SURVEYFREQ Procedure

(X 1:n η) 1 θ e 1. i=1. Using the traditional MLE derivation technique, the penalized MLEs for η and θ are: = n. (X i η) = 0. i=1 = 1.

Multiple Comparisons of Treatments vs. a Control (Simulation)

A New Statistical Procedure for Validation of Simulation and Stochastic Models

Slides 11: Verification and Validation Models

Instructions for Using ABCalc James Alan Fox Northeastern University Updated: August 2009

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

A Macro Application on Confidence Intervals for Binominal Proportion

Now That You Have Your Data in Hadoop, How Are You Staging Your Analytical Base Tables?

Optimizing Pharmaceutical Production Processes Using Quality by Design Methods

StatsMate. User Guide

Data Dictionary. National Center for O*NET Development

186 Statistics, Data Analysis and Modeling. Proceedings of MWSUG '95

The NESTED Procedure (Chapter)

PROC MEANS for Disaggregating Statistics in SAS : One Input Data Set and One Output Data Set with Everything You Need

PCTI Geometry. Summer Packet

Package binmto. February 19, 2015

This code and the crash data set can be found on the course web page.

BIOS: 4120 Lab 11 Answers April 3-4, 2018

StatCalc User Manual. Version 9 for Mac and Windows. Copyright 2018, AcaStat Software. All rights Reserved.

Table of Contents (As covered from textbook)

Effective probabilistic stopping rules for randomized metaheuristics: GRASP implementations

Windows Application Using.NET and SAS to Produce Custom Rater Reliability Reports Sailesh Vezzu, Educational Testing Service, Princeton, NJ

Categorical Data in a Designed Experiment Part 2: Sizing with a Binary Response

Product Catalog. AcaStat. Software

Creating an ADaM Data Set for Correlation Analyses

proc: display and analyze ROC curves

Zero-Inflated Poisson Regression

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

8. MINITAB COMMANDS WEEK-BY-WEEK

1. In this problem we use Monte Carlo integration to approximate the area of a circle of radius, R = 1.

Telephone Survey Response: Effects of Cell Phones in Landline Households

STATA Note 5. One sample binomial data Confidence interval for proportion Unpaired binomial data: 2 x 2 tables Paired binomial data

Detecting Polytomous Items That Have Drifted: Using Global Versus Step Difficulty 1,2. Xi Wang and Ronald K. Hambleton

NCSS Statistical Software

EVALUATION OF THE NORMAL APPROXIMATION FOR THE PAIRED TWO SAMPLE PROBLEM WITH MISSING DATA. Shang-Lin Yang. B.S., National Taiwan University, 1996

Subset Selection in Multiple Regression

A Side of Hash for You To Dig Into

Multivariate Normal Random Numbers

Correctly Compute Complex Samples Statistics

SigmaXL Feature List Summary, What s New in Versions 6.0, 6.1 & 6.2, Installation Notes, System Requirements and Getting Help

Basic Concepts #6: Introduction to Report Writing

Hypothesis Testing: An SQL Analogy

CMISS the SAS Function You May Have Been MISSING Mira Shapiro, Analytic Designers LLC, Bethesda, MD

Sending SAS Data Sets and Output to Microsoft Excel

NCSS Statistical Software. Robust Regression

Choosing the Right Procedure

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research

Chapter 17: INTERNATIONAL DATA PRODUCTS

Paper CC-016. METHODOLOGY Suppose the data structure with m missing values for the row indices i=n-m+1,,n can be re-expressed by

Math 227 EXCEL / MEGASTAT Guide

One Project, Two Teams: The Unblind Leading the Blind

Creating Forest Plots Using SAS/GRAPH and the Annotate Facility

SAS/STAT 14.3 User s Guide The SURVEYFREQ Procedure

Calculation of the Mean and Variance of Lognormal Data Which Contains Left-Censored Observations

Ranking Between the Lines

Uncommon Techniques for Common Variables

STATISTICS (STAT) Statistics (STAT) 1

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Paper SDA-11. Logistic regression will be used for estimation of net error for the 2010 Census as outlined in Griffin (2005).

Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression

Today s class. Roots of equation Finish up incremental search Open methods. Numerical Methods, Fall 2011 Lecture 5. Prof. Jinbo Bi CSE, UConn

Something for Nothing! Converting Plots from SAS/GRAPH to ODS Graphics

Small Sample Equating: Best Practices using a SAS Macro

Transcription:

Paper SD-03 SAS Macros CORR_P and TANGO: Interval Estimation for the Difference Between Correlated Proportions in Dependent Samples Patricia Rodríguez de Gil, Jeanine Romano Thanh Pham, Diep Nguyen, Jeffrey D. Kromrey, Eun Sook Kim University of South Florida, Tampa, FL ABSTRACT Two proportions from the same sample of observations or from matched-pair samples are correlated. A number of studies proposed interval estimation for the difference in correlated proportions (e.g., Bonett & Price, 2011; Newcombe, 1998; Tango, 1998). Considering that confidence intervals (CI) are more informative than point estimates but the CI for the difference in correlated proportions is not readily available in SAS, the purpose of this paper is to provide a SAS macro for three types of confidence intervals suggested in the literature: Wald CI, adjusted Wald CI, and approximate CI proposed by Tango. The results from a simulation study comparing these three confidence intervals are also presented. INTRODUCTION The difference in correlated or dependent proportions is often of interest in studies such as pretest-posttest designs, matched-pair designs, and rater-agreement designs (Bonett & Price, 2011). Currently, McNemar s test (1947) is commonly conducted to test the equivalence of two correlated proportions. McNemar s test is also implemented in SAS (the AGREE option in PROC FREQ). Two dichotomous outcomes measured from dependent samples are summarized in a 2x2 contingency table. For example, the results of two diagnostic tests of dyslexia (Tests A and B) administered to a sample of pre-school children are summarized with the cell frequency of the cases in which diagnosis is correct in both tests (correct-correct), correct in Test A but incorrect in Test B (correct-incorrect), incorrect in Test A but correct in Test B (incorrect-correct), and incorrect in both tests (incorrect-incorrect) as illustrated in Table 1. Test A Test B Correct Incorrect Total Correct a (π 11) b (π 12) a + b (π 1+) Incorrect c (π 21) d (π 22) c + d (π 2+) Total a + c (π +1) b + d (π +2) n (1) Table 1. Contingency Table of Dyslexia Diagnosis from Tests A and B McNEMAR TEST Q STATISTIC In this example, the null hypothesis is that the population proportion of correct diagnosis on dyslexia using Test A (π 1+ = [a + b]/n) equals the population proportion of correct diagnosis using Test B (π +1 = [a + c]/n). Testing the null hypothesis π 1+ - π +1 = 0 is equivalent to testing π 12 - π 21 = 0 because π 11 is common to both π 1+ and π +1. Given the null hypothesis McNemar s test Q statistic is computed as where b is the cell frequency of cases in which diagnosis is correct in test A but incorrect in test B, and c is the cell frequency of cases in which diagnosis is incorrect in test A but correct in test B Under the null hypothesis, the Q statistic follows an asymptotic chi-square distribution with one degree of freedom when b + c is greater than 10 (McNemar, 1947). WALD CONFIDENCE INTERVAL A 100(1 α)% Wald confidence interval for the difference in the population proportions (π 12 - π 21) can be estimated as 1

where [ ] (1) is a critical value at α/2 from the standard normal distribution However, when Newcombe (1998) examined the performance of existing methods to compute the confidence intervals for the difference in correlated proportions, the Wald confidence interval (Equation 1) showed inadequate performance. ADJUSTED WALD CONFIDENCE INTERVAL Recently, Bonett and Price (2011) proposed an alternative CI by making an adjustment to the Wald interval shown in Equation 1, [ ] (2) where each cell proportion is computed by adding one to the cell frequency and two to the total n, for example, and. Bonett and Price reported that the adjusted Wald interval performs as well as an approximate CI proposed by Tango (1998) but the computation is simpler than Tango s. TANGO CONFIDENCE INTERVAL On the other hand, the confidence interval for the difference in two correlated proportions (λ = π 1+ - π +1 = π 12 - π 21) developed by Tango is estimated by solving the following two equations iteratively until the change in estimation is infinitesimal below the predetermined cutoff. (3) where λ is the difference between two correlated proportions, and in Equation 3 is estimated as (4) where A = 2n, B = -b c + (2n b + c)λ, and C = -cλ(1 λ). Although the computational procedures for Tango s CI are more complex than Wald and adjusted Wald intervals, the upper and lower limits are easily found through the secant method with empirically good coverage probabilities (Tango, 1999) and can be applied to small samples with off-diagonal zero cells (Tango, 1998). SAS MACROS Two SAS macros are used to estimate the endpoints of the confidence intervals. The macro TANGO estimates endpoints of the Tango (1998) interval. The macro CORR_P estimates the two Wald estimates and calls the macro TANGO. Both macros are written in BASE SAS. A SAS macro for confidence intervals suggested by Tango (1998) is presented below. Equations 3 and 4 are solved iteratively by evaluating the change in the estimated λ (i.e., the difference in two correlated proportions). %macro Tango (a, b, c, d, n, X0, X1,Z); iteration = 1; * Initializing iteration counter; a = &a;b = &b;c = &c;d = &d;n = &n;x0 = &X0;x1 = &X1;Z = &Z; do until (abs(change) <.000001); 2

* Evaluate x0; lambda = x0; AA = 2*n; BB = -1*b-c+(2*n - b + c)*lambda; CC = -1*c*lambda*(1-lambda); q21_hat = (SQRT(BB**2-4*AA*CC) - BB)/(2*AA); if (n*(2*q21_hat+lambda*(1-lambda))>=0) then fx0 = b-c-n*lambda - Z*SQRT(n*(2*q21_hat+lambda*(1-lambda))); if (n*(2*q21_hat+lambda*(1-lambda))<0) then fx0 = b-c-n*lambda; *+---------------------------------------------------+ * Evaluate x1; +---------------------------------------------------+; lambda = x1; AA = 2*n; BB = -1*b-c+(2*n - b + c)*lambda; CC = -1*c*lambda*(1-lambda); q21_hat = (SQRT(BB**2-4*AA*CC) - BB)/(2*AA); if (n*(2*q21_hat+lambda*(1-lambda))>=0) then fx1 = b-c-n*lambda - Z*SQRT(n*(2*q21_hat+lambda*(1-lambda))); if (n*(2*q21_hat+lambda*(1-lambda))<0) then fx1 = b-c-n*lambda; *+---------------------------------------------------+ * Evaluate change in lambda; +----------------------------------------------------+ x2 = x1 - (fx1 * (x1-x0)/(fx1 - fx0)); * update value of lambda to evaluate; change = x2 - x1; *+-------------------------------------------------------------------------+ * Set x0 and x1 for next iteration of loop, and increment iteration counter; +-------------------------------------------------------------------------+; x0 = x1; x1 = x2; iteration = iteration + 1; end; %mend Tango; The first part of the macro CORR_P runs PROC FREQ with the raw data given below as an input and yields the cell frequencies in the contingency table (i.e., a through d in Table 1) as an output. Next, the output data are restructured to be used in the computation of three types of confidence intervals. In the third part of the macro, the Wald CI (denoted by Wald_old_up and _down) and adjusted Wald CI (denoted by Wald_new_up and _down) are computed using Equations 1 and 2, and the macro for Tango CI is called. A FILE PRINT statement is used to print the obtained confidence interval limits of Tango CI, Wald CI, and adjusted Wald, respectively. %macro CORR_P (indata=, row_var =, col_var =, percent_ci = 95); * output with count and percent; * print suppressed; PROC FREQ /*noprint*/; TABLES &row_var*&col_var / norow nocol out=a; *+-------------------------------------------------------------------------+ * Restructure output data set to a single observation with four frequencies and four percents; +--------------------------------------------------------------------------+; DATA two; set a; row = _n_; array pct[4] pct1 - pct4; 3

array frq[4] freq1 - freq4; retain pct1 - pct4 freq1 - freq4; pct[row] = PERCENT; frq[row] = COUNT; if row = 4 then output; keep pct1 - pct4 freq1 - freq4; proc print data = two; DATA CI; set two; a = Freq4; b = Freq2; c = Freq3; d = Freq1; * These are four cell frequencies; *a = 18; *b = 4; *c = 12; *d = 5; *Proc print data = uplim; * n = a+b+c+d; lambda_hat = (b-c)/n; * Sample difference in proportions; CI_prop = &percent_ci/100; ZL = PROBIT(1 - (1 - CI_prop)/2); * Upper critical value of Z; ZU = PROBIT((1 - CI_prop)/2); * Lower critical value of Z; *+--------------------------------------------+ * Endpoints of orignal Wald interval; +---------------------------------------------+; Wald_old_up = lambda_hat + ZL*sqrt ((((c+b)/n)-lambda_hat**2)/n); Wald_old_low = lambda_hat + ZU*sqrt ((((c+b)/n)-lambda_hat**2)/n); *+---------------------------------------------+ * Endpoints of adjusted Wald interval; +----------------------------------------------+; newb = (b +1)/(n+2); newc=(c+1)/(n+2); Wald_new_up = newb-newc + ZL *sqrt((newb+newc-(newb-newc)**2)/(n+2)); Wald_new_low = newb-newc + ZU *sqrt((newb+newc-(newb-newc)**2)/(n+2)); x0 = -.9999; * starting values of lambda; x1 =.9999; %Tango (a, b, c, d, n, X0, X1,ZU); Tango_up = X1; x0 = -.9999; * starting values of lambda; x1 =.9999; %Tango (a, b, c, d, n, X0, X1,ZL); Tango_Low = X1; DATA_null_; set CI; 4

CI_Level = CI_prop*100; file print header = h notitles; put @1 'Original Wald' @20 wald_old_low 8.5 @30 wald_old_up 8.5 // @1 'Revised Wald' @20 wald_new_low 8.5 @30 wald_new_up 8.5 // @1 'Tango' @20 Tango_low 8.5 @30 Tango_up 8.5; return; h: put @1 'Confidence Intervals for Correlated Proportions' // @1 'Level of Confidence:' @31 CI_Level @33 '%' // @1 'Sample Proportion Difference:' @30 lambda_hat 8.5 /// @1 'CI Method' @20 ' Lower' @30 ' Upper' / @1 '-------------' @20 '--------' @30 '--------'; %mend CORR_P; MACRO EXECUTION As an example of the use of the macro CORR_P, the following data step creates a data set called study, containing 39 observations. Each observation corresponds to pass or fail in Biology and Algebra. The Macro CORR_P is called to illustrate the estimation of confidence intervals for correlated proportions for the SAS user s provided data. DATA study; INPUT bio $ alg $ ; cards; P F P F P F P F 5

; %CORR_P(indata=study, row_var =bio, col_var = alg, percent_ci = 95); OUTPUT EXAMPLE OF MACRO CORR_P The output table from the macro CORR_P is presented in Output 1. The simple table presents the sample difference between the two proportions, the level of confidence requested in the macro call, and the endpoints of the three confidence intervals. In this sample, the upper endpoints of the three intervals are nearly identical, but greater differences are evident in the lower endpoints. Confidence Intervals for Correlated Proportions Level of Confidence: 95% Sample Proportion Difference: 0.20513 CI Method Lower Upper ------------- -------- -------- Original Wald 0.01469 0.39556 Revised Wald 0.00130 0.38894 Tango 0.00443 0.39263 Output 1: Wald and Tango Confidence Intervals SIMULATION RESEARCH The accuracy and precision of the three confidence interval estimation methods were investigated using Monte Carlo methods. In this simulation study three design factors were manipulated: (a) sample size (n = 10, 50, 100, 500, and 1000), (b) magnitude and direction of population correlation between the two proportions (f = -.40 to +.40, in increments of.10), and (c) difference between population proportions (Δ = -.30, -.25, -.10, -.05,.00,.05,.10,.25, and.30). These factors in the Monte Carlo study were completely crossed, yielding 405 conditions. For each condition, 100,000 replications were conducted. The use of 100,000 estimates provides adequate precision for the investigation of confidence interval coverage and width. For example, 100,000 replications provide a maximum 95% confidence interval width around an observed proportion that is.0031 (Robey & Barcikowski, 1992). The data for the simulation were generated using uniform random numbers on the zero to one interval (the SAS RANUNI function). The values of the random numbers were used to assign observations to cells in the contingency table. In each sample, the three confidence interval methods were applied to provide estimates of 90%, 95% and 99% intervals. The distributions of 95% confidence interval coverage estimates across all conditions simulated are presented in Figure 1. These results show that the adjusted Wald method provides the best interval coverage overall among the three methods. The original Wald method provides lower coverage than the nominal confidence level in many conditions. Both the adjusted Wald and the Tango methods provide the higher coverage for most of the sample conditions, but the Tango method produces substantial under coverage in a few conditions. The distribution of interval coverage of the adjusted Wald method is closer to the expected confidence level than that of Tango method. Analogous results were found for the 90% and 99% confidence intervals. 6

95% CI Width 95% CI Coverage SESUG 2013 1.00 Distributions of Interval Coverage (95%) 0.95 0.90 0.85 0.80 0.75 0.70 Original Wald Adjusted Wald Tango Estimation Method Figure 1. Distributions of Interval Coverage at 95% Confidence Level Figure 2 presents the distributions of confidence interval widths across all conditions simulated. The results show that these distributions are nearly identical across all three estimation methods, indicating that the methods provide the same degree of precision. The distributions of interval widths for 90% and 99% confidence levels were also nearly identical. Distributions of Interval Widths (95%) 1.00 0.75 0.50 0.25 0 Original Wald Adjusted Wald Tango Estimation Method Figure 2. Distributions of Interval Widths at 95% Confidence Level 7

The influence of sample size on confidence interval coverage is presented in Figure 3. This figure shows that Tango method provides the most accurate coverage at the smallest sample size examined (n = 10) and provides a slight amount of over coverage with large sample sizes. The original Wald method produces substantial under coverage at the small sample sizes and gets closer to the expected confidence interval coverage when sample size increases. The adjusted Wald method provides slight over coverage at the smallest sample size but quickly converges to the nominal.95 coverage with larger samples. Coverage 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.90 0.89 Figure 3. Mean Interval Coverage by Sample Size for 95% Confidence Level CONCLUSION The macros TANGO and CORR_P facilitate the computation of confidence intervals for correlated proportions, making the estimation of these intervals easy for applied researchers. The macros are written in BASE SAS and do not require additional components for execution. The macros may be used in the present form or may be easily modified. For example, the data set CI produced in the macro CORR_P may be used with SAS/GRAPH to graphically illustrate the obtained confidence intervals. REFERENCES 10 50 100 500 1000 Sample size Estimation Method Original Wald Adjusted Wald Tango Bonett, D. G., & Price, R. M. (2012). Adjusted wald confidence interval for a difference of binomial proportions based on paired data. Journal of Educational and Behavioral Statistics, 37(4), 479-488. doi: 10.3102/ 1076998611411915 McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153-157. Newcombe, R. G. (1998). Improved confidence intervals for the difference between binomial proportions based on paired data. Statistics in Medicine, 17(22), 2635-2650. doi: 10.1002/(SICI)1097-0258(19981130)17:22<2635:: AID-SIM954>3.0.CO;2-C Tango, T. (1999). Improved confidence intervals for the difference between binomial proportions based on paired data by robert G. newcombe, statistics in medicine, 17, 2635-2650 (1998). Statistics in Medicine, 18(24), 3511-3513. doi: 10.1002/(SICI)1097-0258(19991230)18:24<3511::AID-SIM303>3.0.CO;2-A Tango, T. (1998). Equivalence test and confidence interval for the difference in proportions for the paired-sample design. Statistics in Medicine, 17(8), 891-908. doi: 10.1002/(SICI)1097-0258(19980430)17:8<891::AID- SIM780>3.0.CO;2-B 8

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: Patricia Rodriguez de Gil Jeanine Romano University of South Florida University of South Florida 4202 E. Fowler Ave., EDU 105 4202 E. Fowler Ave., EDU 105 Tampa, FL 33620 Tampa, FL 33620 Work Phone: 813 974-3220 Work Phone: 813 974-5012 Fax: 813 974-4495 Fax: 813 974-4495 prodrig6@usf.edu romano@usf.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 9