Two useful macros to nudge SAS to serve you

Two useful macros to nudge SAS to serve you David Izrael, Michael P. Battaglia, Abt Associates Inc., Cambridge, MA Abstract This paper offers two macros that augment the power of two SAS procedures: LOGISTIC and UNIVARIATE. PROC LOGISTIC calculates, among other statistics, several measures that reflect the predictive ability of a logistic regression model. Those are: percent concordant; discordant; and tied pairs, as well as four rank correlation indexes: Somers D; Gamma; Tau-a; and c. The procedure displays them in the Association of Predicted Probabilities and Observed Responses table. In the presence of survey weights, however, the procedure computes those measures ignoring the weights. This makes it difficult for survey researchers to use PROC LOGISTIC for assessment of the predictive ability of a model, because survey weights are commonly used to analyze survey data. The first macro we offer takes the survey weights into account when computing the mentioned Association Parameters and compares the unweighted measures with the ones calculated by the macro. PROC UNIVARIATE provides five methods for computing quantile statistics. However, these may not be enough if a researcher wants to match SAS statistical computation results with those from other statistical packages or use SAS to reproduce statistical computations done in another package. For instance, S-PLUS computes quantiles using a different approach. Our second macro computes quantiles following the algorithm used in S- PLUS and compares its results with respective quantiles produced by PROC UNIVARIATE. Macro I: to Compute the Weighted Association of Predicted Probabilities and Observed Responses Table. Introduction The Association of Predicted Probabilities and Observed Responses table lists several measures of association to help a researcher assess the quality of a logistic model. PROC LOGISTIC computes the percentage of concordant, discordant, and tied observations and the number of observation pairs upon which the percentages are based [1]. If a response variable is set to 1 in case of event and 0 in case of non-event, then for all pairs of observations with different values of the response variable, a pair is concordant if an event observation has a higher predicted probability than a non-event observation; a pair is discordant if an event observation has a lower predicted probability than a non-event observation; and if the predicted probabilities are equal for a pair, it is a tie [2]. PROC LOGISTIC computes percent concordant, discordant, and tied pairs along with the total number of pairs. The four rank correlation indexes in the table are computed from the numbers of concordant and discordant pairs of observations by the following formulae: where Somers D = (nc nd) / t (1) Gamma = (nc nd) / (nc + nd) (2) Tau-a = (nc nd) /.5N(N-1) (3) c = (nc +.5(t nc nd)) / t (4) N is the total number of observations in the input data set. t is the total number of pairs with different response values nc is the number of concordant pairs. nd is the number of discordant pairs [1]. In a relative sense, a model with higher values for these indexes has better predictive ability than a model with lower values for these indexes [2]. It turns out, however, that in the presence of survey weight the LOGISTIC procedure does not work as expected with regard to computation of Association Measures. To test this, we fitted data from an actual survey to the model with just two predictors in both unweighted and weighted cases. To obtain more detail than rounded results, we extracted the calculated measures using ODS: ods listing close; ods output Association=assocu; proc logistic descending data=analytic; class indep1 indep2 ; model response= indep1 indep2 ; 1

ods listing; proc print data=assocu noobs; title3 "Unweighted Association Measures"; ods listing close; ods output Association=assocw; proc logistic descending data=analytic; class indep1 indep2; model response= indep1 indep2; weight wgt/norm; ods listing; proc print data=assocw noobs; title3 "Weighted Association Measures"; The following output shows a complete identity of weighted and unweighted measures, which casts doubt upon the procedure s ability to correctly compute Association Parameters in the presence of survey weights. Unweighted Association Measures Label1 cvalue1 nvalue1 Label2 Value2 nvalue2 Percent Concordant 50.6 50.6048 Somers' D 0.178 0.1777 Percent Discordant 32.8 32.8271 Gamma 0.213 0.2130 Percent Tied 16.6 16.567 Tau-a 0.054 0.0544 Pairs 77623704 77623704 c 0.589 0.5889 Weighted Association Measures probability for the event observation (response is 1) be w i and p_hat i and for the non-event observation (response is 0) be w j and p_hat j. If p_hat i is greater than p_hat j, then, following the definition given in the introduction, the pair will be concordant and its weighted representation w i * w j will be added to the weighted total of concordant pairs. In the same vein, if p_hat i is lower than p_hat j, then the pair will be discordant and its weighted representation w i * w j will be added to the weighted total of discordant pairs. Finally, if the pair is neither concordant nor discordant, the product w i * w j will be added to the weighted total of tied pairs. Denoting W E as the total weighted number of event responses and W N as the total weighted number of non-event responses, the total weighted number of pairs is calculated as W E *W N. Based upon the weighted totals accumulated after E*N iterations, the macro calculates the respective percents and the correlation indexes by formulae (1) (4). This macro reports the correctly calculated weighted measures immediately after the official Association of Predicted Probabilities and Observed Responses table. Exhibit 1 demonstrates the beginning and the end of the listing of the macro that was run over the survey data set. The logistic model used in the example has 12 categorical independent variables expl1 expl12, dependent variable effect (1,0), weight wgt, and is called by the following statement: %wtappor ( ds = survey outds=, weight= wgt, model = expl1-expl12, depvar = effect ); Label1 cvalue1 nvalue1 Label2 Value2 nvalue2 Percent Concordant 50.6 50.6048 Somers' D 0.178 0.1777 Percent Discordant 32.8 32.8271 Gamma 0.213 0.2130 Percent Tied 16.6 16.567 Tau-a 0.054 0.0544 Pairs 77623704 77623704 c 0.589 0.5889 Although with an increase in the number of predictors in the model a certain difference between unweighted and weighted measures emerges, the official weighted measures are by no means what we could expect and use for model assessment. Macro WTAPPOR. As may be seen from Exhibit 1, there are measurable differences between the official measures and those calculated by the macro WTAPPOR. Note that the official weighted number of pairs 77131560 is a product of unweighted (E*N) frequencies of event and non-event responses 18106 and 4260 respectively, whereas the weighted number of pairs calculated and used by the macro 79632519 is a product of total normalized weights (W E *W N ) for event and non-event sets - 17922.953 and 4443.047 respectively. The macro itself is presented in Exhibit 2. It is well commented and easy to use. We offer here the macro WTAPPOR that does take survey weights into account. The macro uses the same formulae (1) (4) but in a weighted form. Let the number of event responses in a sample be E and the number of non-event responses be N. The total unweighted number of pairs being considered is E*N. Let us consider the ij-th pair of observations, and let the weight and the predicted 2

Exhibit 1. Association of Predicted Probabilities and Observed Responses The LOGISTIC Procedure Model Information Data Set WORK.ANALYTIC Response Variable effect Positive Effect Number of Response Levels 2 Number of Observations 22366 Weight Variable wgt Final Weight Sum of Weights 22366 Link Function Logit Optimization Technique Fisher's scoring Response Profile Ordered Total Total Value effect Frequency Weight 1 1 18106 17922.953 2 0 4260 4443.047 NOTE: Weights are normalized to the actual sample size................................................... Official table Association of Predicted Probabilities and Observed Responses Percent Concordant 64.8 Somers' D 0.302 Percent Discordant 34.6 Gamma 0.304 Percent Tied 0.6 Tau-a 0.093 Pairs 77131560 c 0.651 Table calculated by the Macro Association of Predicted Probabilities and Observed Responses using normalized weight WGT Weighted Percent Concordant 66.6 Weighted Somers' D 0.333 Weighted Percent Discordant 33.3 Weighted Gamma 0.333 Weighted Percent Tied 0.1 Weighted Tau-a 0.106 Weighted Pairs 79632519 Weighted c 0.666...................... Exhibit 2. Macro WTAPPOR %macro WTAPPOR (ds =, /* INPUT DATA SET */ outds =,/* OUTPUT DATA SET WITH MEASURES IF BLANK, JUST /*** FIT DATA BY LOGISTIC MODEL TO GET PREDICTED PROBABILITIES ***/ DISPLAYING RESULT */ weight =,/* SURVEY WEIGHT */ model =, /* STRING WITH EXPLANATORY VAR's. ALL MUST BE CATEGORICAL*/ depvar =, /* DEPENDENT VARIABLE, 1-EVENT, 0- NON-EVENT */ ) ; proc logistic descending data=&ds; weight &weight./norm; /*USING NORMALIZED WEIGHT*/ class &model; model &depvar= &model; output out=_probs(keep=&depvar &weight _p_hat) predicted=_p_hat proc sql noprint; /* TOTAL WEIGHTED NUMBER OF RECORDS*/ select sum(&weight) into: tot wgt from _probs; /* TOTAL UNWEIGHTED NUMBER OF RECORDS */ select count(*) into: tot unw from _probs; select count(*) into: tot nev from _probs where &depvar=0; quit; proc summary noprint nway; var &weight; output out=_out sum=_sumw0; /* TOTAL UNWEIGHTED NUMBER OF NON - EVENTS */ /* NORMALIZE WEIGHT */ data _probs1(rename=(_p_hat=_p_hat1 &weight=_w1)) /* EVENT DATA SET */ _probs0(rename=(_p_hat=_p_hat0 &weight=_w0)); /* NON EVENT DATA SET*/ set _probs; if _n_=1 then set _out; &weight= &weight.*&_tot_unw./_sumw0;/*normalization*/ _concord=0; _discord=0; _tie=0;***-> INITIALIZE MEASURES; if &depvar=1 then do; keep _p_hat &weight _concord _discord _tie ; output _probs1;end; else do; keep _p_hat &weight; output _probs0; end /* WEIGHTED TOTAL OF EVENTS */ 3

proc summary data=_probs1 noprint nway; var _w1; output out=_total1 sum=_total1; /* WEIGHTS TOTAL OF NON-EVENTS */ proc summary data=_probs0 noprint nway; var _w0; output out=_total0 sum=_total0; data _total; /* DATA SET WITH WEIGHTED TOTAL */ merge _total1 _total0; _total_p =_total1*_total0; %macro cummsr; %do i=1 %to & tot nev; /* COMPARE EACH EVENT OBSERVATION WITH EACH NON-EVENT OBSERVATION */ data _probs1; set _probs1; if _n_=1 then set _probs0(firstobs=&_i obs=&_i); if _p_hat1<_p_hat0 then /* ACCRUE DISCORD*/ _discord=_discord+_w0*_w1; else if _p_hat1>_p_hat0 then _concord=_concord+_w0*_w1; /*ACCRUE CONCORD */ else _tie=_tie+_w0*_w1; /* ACCRUE TIES */ drop _p_hat0 _w0; %mend cummsr; %cummsr; /* SUM ACCORDANCE, CONCORDANCE AND TIES*/ /* THROUGH THE WHOLE DATA SET */ proc summary data=_probs1 noprint nway; var _concord _discord _tie; output out=_out sum=_concord _discord _tie; /* CALCULATION OF PERCENTAGE AND MEASURES */ /* BY FORMULAE 1 4 */ data &outds _out(keep=wgt_:); merge _out _total; Wgt_Percent_Concordant = round(_concord*100/_total_p,.01); Wgt_Percent_Discordant = round(_discord*100/_total_p,.01); Wgt_Percent_Tied = round(_tie*100/_total_p,.01); Wgt_Pairs = _total_p; Wgt_Somers_D = (_concord - _discord) / _total_p; Wgt_Gamma = (_concord - _discord) / (_concord + _discord); Wgt_Tau_a = (_concord - _discord) /(.5*&_tot_unw.*(&_tot_unw - 1)); Wgt_c=(_concord +.5*(_total_p - _concord - _discord))/_total_p; /* DISPLAY RESULTS AFER OFICIAL TABLE */ data null ; set out; file print ls=80 ps=59; put Association of Predicted Probabilities and Observed Responses ; put using normalized weight &weight ; put; put Weighted Percent Concordant Wgt_Percent_Concordant 5.2 " Weighted Somers' D " Wgt_Somers_D 6.4; put Weighted Percent Discordant Wgt_Percent_Discordant 5.2 " Weighted Gamma " Wgt_Gamma 6.4 ; put Weighted Percent Tied Wgt_Percent_Tied 5.2 " Weighted Tau-a " Wgt_Tau_a 6.4 ; put Weighted Pairs Wgt_Pairs 10. " Weighted c " Wgt_c 6.4 ; %mend wtappor; Summary. The presented macro, WTAPPOR, is a valuable instrument for a survey researcher to assess the quality of a logistic model when survey weights are present. The macro gives appreciably different measures of association from those calculated by PROC LOGISTIC. Macro II: Are five methods to compute quantiles enough? If not, get a sixth one. Introduction The reader will remember that using PCTLDEF= option in PROC UNIVARIATE, one can specify one of five methods for computing quantile statistics. Following the definitions in [3], let n be the number of nonmissing values for a variable and let x 1,,x n represent the ordered values 4

of the variable. For the tth percentile, let p=t/100. For definitions 1, 2, 3, and 5 below, let np = j + g, where j is the integer part and g is the fractional part of np. For definition 4, let (n+1)p = j + g. Then, the tth percentile, y, is defined as follows: PCTLDEF = 1 weighted average at x np y = (1 g) x j + gx j+1, where x 0 is taken to be x 1 PCTLDEF = 2 observation numbered closest to np y = x i, where i is the integer part of np + ½ if g ½. If g = ½, then y= x j if j is even, or y = x j+1 if j is odd. PCTLDEF = 3 empirical distribution function, y = x j if g = 0, y= x j+1 if g > 0 PCTLDEF = 4 weighted average aimed at x p(n+1), y = (1 g)x j + gx j+1, where x n+1 is taken to be x n PCTLDEF = 5 empirical distribution function with averaging, y = (x j + x j+1) /2 if g = 0, y = x j+1 if g>0. Researchers often need to match results obtained by SAS with those given by another statistical package or to reproduce with SAS statistical computations done in another package. If quantiles are involved in those statistical computations, matching may fail because another statistical package may compute quantiles differently. For example, S-PLUS uses the function quantile(x, p) that computes quantiles at specified probabilities linearly interpolating and using formula: quantile(x, p) = [1-(p(n 1) - p(n-1) )]x 1+ p(n-1) + [p(n-1) - p(n-1) ]x 2+ p(n-1) (5) where x 1,,x n is the ordered sample, p is specified probability, denotes the floor or integer part of [4]. The result of the function quantile(x, p) will not be generally identical to any of the five methods described above. Below, we present the macro QUANT6SP that computes S- PLUS-like quantiles by formula (5) and compare its results with those obtained by the five methods of PROC UNIVARIATE. Macro QUANT6SP The macro presented below is richly supplied with comments and is easy to use. %macro quant6sp ( inds=, /* input data set with variable of interest */ var =, /* variable upon which to compute quantiles */ ncell=, /* number of cells boundaries of which are to be determined by quantiles,4 - for quartiles */ prfx=, /* prefix we want for quantiles variables */ outds=, /*data set with quantiles */ ); %let _step = %sysevalf(1/&ncell); /* bounders of quantiles */ data _temp; %macro stq; /* create string with boundaries of quantiles */ f= "0 " %do i=1 %to %eval(&ncell-1); %sysevalf(&i*&_step) ' ' " 1"; %mend; %stq; data _null_; set _temp; call symput('pctl',left(f)); /* create macro variable as string with boundaries of quantiles */ %put BOUNDARIES OF QUANTILES: &pctl; proc sort data=&inds (keep=&var) out=_i; by &var; /* order variable in ascending*/ data _null_; /* number of records,that is values of variable */ set _i end=fin; retain _n; _n+1; if fin then call symput('totn',left(_n)); %do l=1 %to %eval(&ncell+1); %let p&l =%scan(&pctl,&l, %str( )); /* retrieve boundary and put it into respective macro var*/ data &outds (keep= &prfx.:) ; set _i end=_fin; retain %do j=1 %to %eval(&ncell+1); _less&j _greater&j 0; /* retrieve values of variable for formula (1) components */ %do j=1 %to %eval(&ncell+1); /* accumulate components of formula (1) for all quantiles */ 5

if _n_ = 1+floor(%sysevalf(&&p&j*(&totn-1))) then _less&j=&var; if _n_ = 2+floor(%sysevalf(&&p&j*(&totn-1))) then _greater&j=&var; if _fin then do; /* compute formula (1) for all boundaries /* using one passage through data set */ %do j=1 %to %eval(&ncell+1); &prfx&j=(1-(%sysevalf(&&p&j*(&totn-1)) - floor(%sysevalf(&&p&j*(&totn-1)))))*_less&j + (%sysevalf(&&p&j*(&totn-1)) - floor(%sysevalf(&&p&j*(&totn- 1))))*_greater&j; output; end; proc print; %mend; Results Here, we present an example of macro QUANT6SP call to break down predicted probabilities by quartiles: %quant6sp (inds = probs, var = probabs, ncell=4, outds=out, prfx =method6); The computed quartiles are shown below: 0% 25% 50% 75% 100% 0.018626 0.58884 0.65698 0.71963 0.79855 Applying each of the five methods of PROC UNIVARIATE to the same variable probabs, we obtain the following table: PCTLDEF 0% 25% 50% 75% 100% 1 0.018626 0.58633 0.65676 0.71951 0.79855 2 0.018626 0.58633 0.65676 0.71951 0.79855 3 0.018626 0.58633 0.65676 0.71951 0.79855 4 0.018626 0.58717 0.65698 0.71987 0.79855 5 0.018626 0.58800 0.65698 0.71975 0.79855 As is shown, none of the five sets of quartiles above is identical to the results obtained using the macro quant6sp References 1.SAS Institute, Inc (1999). SAS/STAT.Version 8. Chapter 39, Cary, NC: SAS institute Inc. 2. Logistic Regression Examples. Using the SAS System. SAS Institute Inc.,1995 3. SAS Institute, Inc (1999). SAS/BASE. SAS Procedures Guide. PROC UNIVARIATE. 4. Venables, W.N, Ripley B.D (2000) Modern Applied Statistics with S-PLUS, Springer-Verlag, New York Contact Information David Izrael Abt Associates Inc. Cambridge, MA 02338 tel: (617) 349-2434 e-mail: david_izrael@abtassoc.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies 6