An Algorithm to Compute Exact Power of an Unordered RxC Contingency Table

Size: px

Start display at page:

Download "An Algorithm to Compute Exact Power of an Unordered RxC Contingency Table"

Christopher Morton
5 years ago
Views:

1 NESUG 27 An Algorithm to Compute Eact Power of an Unordered RC Contingency Table Vivek Pradhan, Cytel Inc., Cambridge, MA Stian Lydersen, Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Norway ABSTRACT Chi-square, Likelihood Ratio and Fisher-Freeman-Halton test statistics are used to test the association of an unordered rc. Although asymptotically all these statistics follow a chi-square distribution, an eact conditional test based on the permutation distribution is recommended for small samples. The eact power computation of such tests involves huge numbers of permutation tables, and as a result computation becomes infeasible. The asymptotic power computation of these methods can be done using a non-central chi-square test (see Agresti A, 22: Categorical Data Analysis). However, to our knowledge, there is no algorithm/method available to compute the eact power. In this article we give an efficient algorithm and a SAS macro to compute eact power using Chi-square, Likelihood Ratio and Fisher-Freeman-Halton test statistics. The SAS macro also reports eact power for the Mid-P and the randomized versions of the tests. INTRODUCTION The Chi-square (CH), Likelihood Ratio (LR), and Fisher-Freeman-Halton (FI) statistics are commonly used to test the association of an unordered rc table. For an unordered rc table with elements, i-th row sum, j-th column sum n + j, and total sum N, these statistics are defined as follows: n i + CH ( ) = r c ni+ n N n n i= j= i+ + j N + j 2 where LR( ) = 2 r c log i= j = ni + n + j N ( γp( )) FI( ) = 2log γ 2 r c ( r )( c ) ( rc ) ( c ) ( c ) ( 2π ) N ( ni + ) ( n+ j ) i= j = = Under the null hypothesis the above three statistics asymptotically follow a chi-square distribution with (r-)(c-) degrees of freedom. The asymptotic inference may not be appropriate when the sample size N is small. In such situations the inference derived from the eact permutation distribution is appropriate (Mehta and Patel(983) ). The size of the eact test (and hence the p-value) derived from the permutation distribution is very conservative and can be improved by subtracting half the probability of the observed statistics from the eact p-value. This form of p-value is also known as Mid p-value. Although mid-p-value reduces the conservatism of the eact test, it does not always preserve the type- error (Lydersen and Laake (23), Lydersen et al. (25), Lydersen et al. (27)). Another form of inference can be obtained by randomizing the test. The p-value of a randomized test version always preserves the type- error.

2 NESUG 27 In a randomized test version compute the net possible lower p-value than what was actually observed: ( ( ) > T( n) ) = P( T( ) T( n) ) P( T( ) T( n) ) pnet = P T = Reject H with probability P( rejecth α pnet n) = g pvalue pnet where g( t) = t if t < t t > and T(n) is the observed test statistic and T() is any generic test statistic. The asymptotic power computation to test the association of a rc table can be done using a non-central chi-square statistic (see Agresti (22) ). This method may be applied when the sample size is large. However, for an rc table with a small sample size, the inference using asymptotic methods should not be used, and the less conservative method using Mid-p-value is recommended (see Lydersen et al. (27)). In the following sections we give an efficient method to compute the eact power of an unordered rc table using the Monte Carlo method. METHOD TO COMPUTE EXACT POWER OF A RXC TABLE Let i=.r, j=c be the observed cell frequency of a rc table and let 2 = r 2 22 r 2 c 2c rc be the multinomial row probabilities under Then power is defined as the following:. Then the probability of such table is given by: H r c + c P( n, ) = ni / i= j = j = β ( ) = P(reject H = P( reject H ) P( N; ) The eact power computation requires first to generate all possible combinations (outcomes) of tables so that the total sample is N. For a 22 table with fied row sums, this can be easily computed. However, if the number of rows or columns is greater than 2, then the number of possible combinations eplodes. For eample, when the row sums are 2, the number of possible outcomes of a 22 table is, where for a 32 table this number is 926 and for a 33 table this is As a result the following eact and Monte Carlo method is adopted: i. Generate a rc table by taking r multinomial samples with 2

3 NESUG 27 distribution Mult( n i +, i. ic) for each row, i= r. ii. Compute β (n), the eact conditional inference of such table. ), iii. Repeat steps and 2 M times and thereby generates, β ( β (2),., β ( M ). iv. Finally the eact power is the average of β ( ), β (2),., β ( M ) IMPLEMENTATION OF THE PROPOSED METHOD The core problem of this eact power computation is to reduce the number of tables. To do this, in SAS first create a dataset with M (the number of Monte Carlo sampling) by-variables where M is the number of Monte Carlo sampling. Therefore, each by-variable is nothing but an rc table, to be used for eact inference. Needless to mention here, all of the by-variable are not representing a unique contingency table. We used SAS datastep to find the number of distinct tables with the frequency of occurrences using the following way:. For each rc table, compute column sums (row sums) of each column (row) and then order the columns (rows) by column sums. In this way one can bring all the smaller cell values in the upper left corner of a contingency table. For eample in the following table first the column sums (2,6, 2) has been ordered and then the same has been done for row sums ( 5, 5, 5). 3 5 order 3 5 order 2. Write all the cell values starting left to right in a single row. Therefore, a single rc table is represented by a single row. 3. Count the number of distinct rows (therefore, the distinct tables) using the following logic in SAS: Data <dataset>; By by-variable; If first.by-variable then count=; count+; if last.by-variable then output; Once all the distinct tables and the corresponding count of occurrences are found, call SAS s PROC FREQ with a specified by-variable to produce eact inferences (mainly the eact p-value and corresponding point probabilities) of each by variable. Finally ii -iv is applied with adjusting the table counts. AN EXAMPLE The following eample is inspired by the Oral data given in the StatXact PROCs user Manual. The dataset is a 39 table with the following cell counts = 8 8 The above eample is well known for its sparseness. The asymptotic inference using chi-square distribution gives a very high p-value, however, the inference using eact method is significant. Consider the following probabilities under alternate hypothesis: =

4 NESUG 27 where =. The SAS program is run for simulations. The implemented algorithm first reduces n i+ tables to 3253 (appro) distinct tables and then computes the Powers. On a 2.66 Ghz machine with 2 GB RAM, it spends less than 2 minuets and shows the following output: CH LI FI Asymptotic 8% Eact 98% 9% 96% Eact-Midp 98% 95% 96% Randomized 98% 96% 97% Notice, here all the powers using eact method are % more than that of the asymptotic method. CONCLUSION One may use full multinomial or Poisson sampling (Agresti 22, Lydersen et. al. (27)) to do Monte Carlo sampling. However, in this article we have used only product multinomial sampling. We feel that this kind of sampling is sufficiently good enough under this setting. REFERENCES Agresti Alan (22). Introduction to Categorical Data Analysis. New York: Wiley. Lancaster HO (96). Significance tests in discrete distributions. Journal of the American Statistical Association 56: Lydersen S, Laake P(23). Power comparison of two-sided eact tests for association in 22 contingency tables using standard, mid p, and randomized test versions. Statistics in Medicine 22 (2): Lydersen S, Pradhan V, Senchaudhuri P, Laake P (25). Power comparison of two-sided eact tests for association in 22 contingency tables using standard, mid p, and randomized test versions. Journal of Statistical Computation and Simulation 75 (6): Lydersen S, Pradhan V, Senchaudhuri P, Laake P (27). Choice of test for association in small sample unordered rc tables. Statistics in Medicine 26 (23): Mehta CR, Patel NR (983). A network algorithm for performing Fisher's eact test in rc contingency tables. Journal of the American Statistical Association 78:27-3. StatXact 8 PROCs User Manual (27). An eact nonparametric inference for categorical data for SAS users. Cytel Inc., Cambridge, MA 239. SAS Program: %macro tabgeneration(dataname=, number=,rowtotal=,alpha=) ; %global nrows ncols; /*getting the number of rows and columns from the input */ %let dsid = %sysfunc(open(&dataname)); %let nrows=%sysfunc(attrn(&dsid,nobs)); %let ncols=%sysfunc(attrn(&dsid,nvars)); %let rc = %sysfunc(close(&dsid)); /*preparing probabilities for mutinomial samplings */ data input_;set &dataname; %if &ncols > 2 %then %do; data input_;set &dataname; %do vr= %to &ncols; %if &vr= %then %do; varsum&vr=; var_&vr=var&vr; % %else %do;

5 NESUG 27 varsum&vr=varsum% eval(&vr-)+var%eval(&vr-); var_&vr=var&vr/(-varsum&vr); % % keep var_-var_&ncols ; data input_; set input_; array a[*]var_-var_&ncols ; do i= to dim(a); if a[i]> then a[i]=; drop i; % /*end of data preparation */ proc transpose data=input_ out=transpose; proc sql noprint; %do ii= %to %sysevalf(&nrows); select col&ii into:pi&ii separated by ' ' from transpose; % quit; /*creating a data with total byvar=#of sampling based on product mult sampling */ data test; ntables=%sysevalf(&number); do tabno = to ntables; nrows = %sysevalf(&nrows); ncols = % sysevalf(&ncols); do row= to nrows; %do iii= %to %sysevalf(&nrows); if row=%eval(&iii) then rowsum=%scan(&rowtotal,&iii ); % col =; do while( col <=ncols- and rowsum >); %do ii= %to %sysevalf(&nrows); if row=%eval(&ii) then pi=scan("&&pi&ii",col,' '); % if pi= then pi=.; else if pi= then pi= ; wgt =ranbin(,rowsum,pi); output; rowsum = rowsum-wgt; col=col+; if( rowsum >) then do; wgt = rowsum; col = ncols; output; keep tabno row col wgt; /*end of data creation based on product mult sampling */ /********************************************************************************/ /*Sorting the tables with cell values, bringing the all 's at the upper left corner */ /*findding col sum */ data test; set test;format wgt z2.; proc sort data=test out=out2;by tabno col; data out3; set out2;by tabno col; if first.col then colsum=; colsum+wgt; if last.col then output; proc sort data=out3;by tabno colsum; data out; set out3; if first.tabno then col=; col+; keep tabno col col; 5

6 NESUG 27 proc sort data=out;by tabno col; proc sort data=test;by tabno col; data mr;merge test out ;by tabno col; proc sort data=mr;by tabno row col; /*arranging the rows */ proc transpose data=mr out=out_;id col; var wgt; by tabno row; data out_;set out_; ord=catt(of _-_%sysevalf(&ncols));/*number of columns */ proc sort data=out_ out=out_2;by tabno ord; data out_3 ; set out_2; if first.tabno then row=; row+; drop row _name_ ord; proc transpose data=out_3 out=out_;by tabno row; data out_5; set out_; array a[*]col; do i= to dim(a); if a[i]=. then a[i]=; drop i; /******************************************************************************/ /*Finding distinct tables with the total frequences */ data two; set out_5; length yy $; if first.tabno then yy=put(col,z2.); else yy=trim(yy) ',' put(col,z2.); if last.tabno then output; drop col; retain yy; keep tabno yy; proc sort data=two; by yy; data thr; set two; by yy; retain count ; array ids(%eval(&number)); if first.yy then do; count=; do i = to dim(ids); ids(i)=.; count+; ids(count)=tabno; if last.yy then output; drop tabno i; retain _all_; keep count ids; proc sort data=thr;by ids; data abc_; merge test thr(rename=(ids=tabno)); if count=. then delete; /*Computing the eact p-values and the corresponding point probabilities */ 6

7 NESUG 27 proc freq data=abc_ noprint; tables row*col/out=out outpercent nowarn; weight wgt; eact pchi lrchi fisher/point; output out=output_ chisq; data out3 ; merge out (where=(pct_row=)) output_ ; if first.tabno then output; keep tabno P_PCHI p2_fish pt_fish p_lrchi pt_lrch p_pchi pt_pchi; /*taking care of the situation when the input table reduced to a n table*/ data out3 ; set out3 ; array a[*]p_pchi p2_fish pt_fish p_lrchi pt_lrch p_pchi pt_pchi; do i= to dim(a); if a[i]=. then a[i]=; drop i; data out3 ; merge out3 thr(rename=(ids=tabno)); pthalf_ch=.5*pt_pchi; pthalf_fi=.5*pt_fish; pthalf_lr=.5*pt_lrch; /*computing midp values */ midpval_ch=p_pchi-pthalf_ch; midpval_fi=p2_fish-pthalf_fi; midpval_lr=p_lrchi-pthalf_lr; /*calculationg g(t)of the randomized test version */ rndpval_ch=(min(ma(,(&alpha - (p_pchi -2* pthalf_ch))/(p_pchi-(p_pchi-2*pthalf_ch))),)); rndpval_fi=(min(ma(,(&alpha - (p2_fish -2*pthalf_fi))/(p2_fish-(p2_fish-2*pthalf_fi))),)); rndpval_lr=(min(ma(,(&alpha - (p_lrchi -2*pthalf_lr))/(p_lrchi-(p_lrchi-2*pthalf_lr))),)); /*computing flags for power computation */ if P_PCHI<=&alpha then as_ch=; else as_ch=; /*Computing for chi-square statistic */ if p_pchi<=&alpha then stdflag_ch=; else stdflag_ch=; if midpval_ch<=&alpha then midpflag_ch=; else midpflag_ch=; totas=count*as_ch; totstd_ch=count*stdflag_ch; totmidp_ch=count*midpflag_ch; totrnd_ch=count*rndpval_ch; /*Computing for Fisher statistic */ if p2_fish<=&alpha then stdflag_fi=; else stdflag_fi=; if midpval_fi<=&alpha then midpflag_fi=; else midpflag_fi=; totstd_fi=count*stdflag_fi; totmidp_fi=count*midpflag_fi; totrnd_fi=count*rndpval_fi; /*Computing for Likelihood-ratio statistic */ if p_lrchi<=&alpha then stdflag_lr=; else stdflag_lr=; if midpval_lr<=&alpha then midpflag_lr=; else midpflag_lr=; totstd_lr=count*stdflag_lr; totmidp_lr=count*midpflag_lr; totrnd_lr=count*rndpval_lr; /*computing the powers */ proc sql noprint; create table power_ as select sum(totas)/&number as ASCHPOW, sum(totstd_ch)/&number as CHIPOW_STD,sum(totmidp_ch)/&number as CHIPOW_MIDP,(sum(totrnd_ch))/&number as CHIPOW_RND, sum(totstd_fi)/&number as FIPOW_STD,sum(totmidp_fi)/&number as FIPOW_MIDP,(sum(totrnd_fi))/&number as FIPOW_RND, sum(totstd_lr)/&number as LRPOW_STD,sum(totmidp_lr)/&number as LRPOW_MIDP,(sum(totrnd_ch))/&number as LRPOW_RND from out3 ; quit; proc transpose data=power_ out=final_; proc format; ; value $name 'ASCHPOW'='Asymptotic Chi-square' 'CHIPOW_STD'='Chi-square with Eact p' 'CHIPOW_MIDP'='Chi_square with Midp' 'CHIPOW_RND'='Chi_square with Randomized' 'FIPOW_STD'='Fisher with Eact p' 'FIPOW_MIDP'='Fisher with Midp' 'FIPOW_RND'='Fisher with Randomized' 'LRPOW_STD'='Likelihood-ratio with Eact p' 'LRPOW_MIDP'='Likelihood-ratio with Midp' 'LRPOW_RND'='Likelihood-ratio with Randomized' 7

8 NESUG 27 proc print data=final_(rename=(_name_=method COL=Power))noobs ; title"***************************************************************************************"; title2 "* Eact Power of a rc table using different methods *"; title3 ***************************************************************************************"; format Method $name.; %m /*observed proportions from oral data */ data one; input var-var9; cards; ; options nonotes; %tabgeneration(dataname=one, number=,rowtotal= 7,alpha=.5) ; CONTACT INFORMATION Please send your comments or further inquiries at Vivek Pradhan, Cytel Inc., Cambridge, MA 239, USA Phone: (work) vpradhan@cytel.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. 8

Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research

Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research Liping Huang, Center for Home Care Policy and Research, Visiting Nurse Service of New York, NY, NY ABSTRACT The