USING MACROS TO CREATE PARAMETER DRIVEN PROCEDURES THAT SUMMARIZE AND PRESENT STATISTICAL OUTPUT IN TABULAR FORM

Size: px

Start display at page:

Download "USING MACROS TO CREATE PARAMETER DRIVEN PROCEDURES THAT SUMMARIZE AND PRESENT STATISTICAL OUTPUT IN TABULAR FORM"

Candice Gardner
5 years ago
Views:

458 Statistics USING MACROS TO CREATE PARAMETER DRIVEN PROCEDURES THAT SUMMARIZE AND PRESENT STATISTICAL OUTPUT IN TABULAR FORM John A. Wenston National Development and Research Institutes, Inc.

1 458 Statistics USING MACROS TO CREATE PARAMETER DRIVEN PROCEDURES THAT SUMMARIZE AND PRESENT STATISTICAL OUTPUT IN TABULAR FORM John A. Wenston National Development and Research Institutes, Inc. INTRODUCTION While SAS has a wide range of powerful statistical procedures, finding relevant statistics, particularly when doing exploratory analysis, and presenting them in an easy to read format often involves wading through reams of output, and constructing summary tables manually. Moreover, some procedures do not provide certain measures that are important to users in evaluating results. This paper demonstrates how to use the SAS macro facility to create parameter driven procedures that summarize and present statistical output in tabular form. The strategy employed is to write procedures that "capture" key statistics and place them in a SAS data set, which can then be summarized by USing DATA steps and PROCs, and printed using PUT statements. PROC PRINT. or PROC REPORT. There are two ways of getting statistics into a SAS data set. The first is to run Statistical PROCs and direct printed output to a sequential data file with PROC PRINTTO. A SAS DATA step can then read this file, pick up key statistics, and put them into a SAS data set. The second is to use the OUTPUT and OUTEST data set options provided by most SAS statistical procedures. This paper will demonstrate both methods, using PROC LOGISTIC. The data base used in the examples is from a NIDAfunded cross sectional survey research project, Risk Factors for AIDS Among Intravenous Drug Users (NIDA # R01 DA3574, Don C. DesJarlais Principal Investigator). The data base contains information from structured questionnaires and blood tests, and includes data on HIV status, demographics, risk behaviors and service utilization. Each observation contains approximately 900 variables. and there are over 2000 respondents enrolled in the study. THE BASIC REPORT Figure 1 is an example of a report based on PROC LOGISTIC. The report summarizes the results of three separate bivariate logistic regressions. The dependent variable is HIV status (0 = negative. 1 = positive) and the predictor variables are SPDBALL (whether the respondent injects speedball, a mixture of heroin and cocaine). CRACK (whether the respondent smokes crack), and INJ4_DAY (whether the respondent injects 4+ times per day). For brevity's sake only three variables are included in the report. More commonly a report will summarize the results of 30 or 40 bivariate regressions. The report macro uses another macro. MPARSE. that takes a list of variables as an argument (each variable must be separated by at least one space). and returns macro variables in the form &X1,&X2...&Xn. where &X1 contains the name of the first variable in the list. &X2 the second, etc. In addition. MPARSE creates a variable, &NVARS. that contains the number of variables in the list. The code for MPARSE is: %macro mparse(x=); %do 1=1 %to 100; %global x&i; %global nvars; %Iet 1=1; %Iet z=init; %do %whije (&1 <= 100 and &z ne ); %Iet z=%scan(&x,&i); %if &z ne %then %do; %Iet x&i = &z; %Iet nvars=&i; %Iet i = %eval(&1+1); %mend mparsej As an example, the macro call %mparse(x=a be); would produce &X1, &X2 and &X3. equal to a, b, and c respectively. as well as &NVARS. equal to 3. The program sets up a DO WHILE loop and uses the %SCAN function to pick out the names in the Ust, which are then assigned to successively numbered macro variables. When the end of the list is reached, the function returns a null value and the program drops out of the loop. (The loop also ends after the 100th variable in the list is read) Note that the

2 statistics 459 MACRO EXAMPLE Parameters and Odds Ratio Estimates Using Proc Logistic Each Variable Was Entered in a Separate Equation Dependent Variable=HIV Predictor Nfor Nfor Variable Event=<> Event=1 SPDBALL CRACK I NJ4_DAY Beta WaI:I 95%CI Sid Cli Odds Lower Err Square Ratio Bound Figure 1 95"IoCI Upper Bound p(chi) Data Set: WORK.RISK Response Variable: HIV Response Levels: 2 Number of Observations: 877 Link Function: Logit The LOGISTIC Procedure Response Profile Ordered Valle HIV 1 Y 2 N Count WARNING: 29 observation(s) were deleted due to missing values for the response or explanatory variables. Criteria for Assessing Model Rt Intercept Intercept and Criterion Only Covariates ~are for Covariates AlC SC -2LOO L Score Variable Parameter Estimate INTERCPT SPDBALL with 1 OF (p=o.0001) with 1 OF (p=0.0001) Analysis of Maximum Likelihood Estimates Standard Enor WaI:I Chi-SqJare _7615 Pr> Chi-square Association of Predicted Probabilities and Observed Responses Concordant = 27.7% Somers' 0= Discordant = 15.7% Garrma = Tied = 56.7% Tau-a = ( pairs) c = Figure 2 Standarcized Estimate

3 460 statistics program declares the &Xi and &NVARS variables as GLOBAL. This is because the call to MPARSE is usually made from within another macro, and the %GLOBAL declarations are needed to make the variables available to the macro that makes the call. Also, note that the upper limit on the number of variables is set to 100. This can be adjusted simply by changing the limits of the loops in the macro. 1. REPORT USING PROC PRINTTO Figure 2 is an example of printed output from PROC LOGISTIC. The output was produced by the following code: value yesnofmt O:'N' 1:' Y'; proc logistic noslmple order=formatted; where race ne 1; I*exclude whltes*' model hlv:spdball; format hlv yesnofmt.; The option NOSIMPLE prevents the printing of simple statistics for each independent variable, thus reducing the length of the print file. The ORDER=FORMATTED option, along with yesnofmt, makes the event value (1=' Y') lower than the nonevent value (O='N'). This is done because the PROC takes the lower of two values to represent the event, while the Risk Factors study coding scheme uses 1 to represent the event and 0 the non-event. If the output shown in Figure 2 is directed to a file with PRINTTO, each record in the file will correspond to one line in the printout. Note that three lines contain items that are highlighted. These are the data items required for the report. The object is to design a program that will pick up this information from the print file, and to write the program using the macro language so that it is general purpose. The macro that generates the logistic report is called BILOGS (for Bivariate LOGistic regressions), and was developed on a CMS system using Version The macro requires 5 parameters: DATA - The name of the data set that the regressions will be performed on. Y- The name of the dependent variable. This variable.llllj.s1 be coded 0/1, with 1 =event. XLiST - A list of predictor variables, separated from one another by at least one space. WHERE - An optional condition conforming to the rules of the WHERE statement that limits processing to a subset of the data. TITLE - An optional title to appear at the top of each page of output. The macro BILOGS follows: %macro bllogs(data:,xllst:,y:,where:, tltle=); 1* parse xlist to create &x1-&xn, &nvars */ %mparse(x:&xlist); value ynfmt O='N' 1=' Y'; options nocenter IIneslze=79; /* ***"" Direct output to a disk file * / filename out1 'Iogdat dat' recfm=f Irecl:80; proc prlntto print:out1 new; run; /* Do regressions with output sent to file*/ %do 1=1 %to &nvars; proc logistic noslmple data:&data order=formatted; %if &where ne %then %str(where &where;); model &y = &&x&l; format &y ynfmt.; proc printto; run; 1* redirect print to the default device */ 1* Now read the file of printed output */ data ddest; Inflle out1; retain n-yes 0 n_no 0; length firstwrd $ 8; /* ***** get first word In line**""* */ Input firstwrd $;,. *Test to determine If It Is "Value". If so,*',. get number yes(1) & no(o) responses */ If upcase(firstwrd)='v ALUE' then do; Input ordval1 fmtval1 $ n-yes ordvalo fmtvalo $ n_no; 1* Else determine if It Is "INTERCpr' *' 1*' If so, get the parameter values */ else If flrstwrd='intercpr then do; Input name $ beta se chi pvalue sc; oddratio:exp(beta); 1* odds ratio */ 1* Compute lower &upper bounds 95% CI *'

4 Statistics 461 low95=exp{beta-1.96*se); up95=exp(beta+ 1.96*se); depvar="&y"; output; keep beta se chi oddratlo low95 up95 pvalue depvar name n_no n_yes; f* Now use Proc Print to print report * */ options center Iineslze=79 pageno=1; proc print label split=' '; by depvar; var n_no n-yes beta se chi odd ratio low95 up95 pvalue; id name; label name='predlctor*varlable' n_no=' N for*event=o' n_yes=' N for*event=1' beta='beta' se='std*err' chi=' Wald * Chi Square' oddratlo=' Odds *Ratio' low95='95% CI*Lower*Bound' up95='95% CI*Upper*Bound' pvalue='p(ch I)' depvar='oependent Variable'; format odd ratio low95 up pvalue 5.3 beta se 6.3; title "&title"; title2 'Parameters and Odds Ratio Estimates Using Proc Logistic'; title3 'Each Variable Was Entered In a Separate Equation'; %mend bllogs; The following command would produce the report shown in Figure 1 : %bilogs(data=risk,y=hlv, xlist=spdball crack Inj4_day, where=race ne 1, tltle=macro EXAMPLE); The macro first invokes %MPARSE to create &X1 =spdball, X2=crack, &X3=inj4_day and &NVARS = 3. Next, the macro uses PRINTTO to redirect any printed output that follows to a disk file rather than the default print device. The macro then sets up a loop that runs PROC LOGISTIC once for each X variable, sending three "pages" of output to the file LOGDAT OAT. Once the loop is finished, PRINTTO is invoked again with no FILE= specification. This redirects any subsequent printed output back to the default device. The next section of the program picks up information from the print file by reading the first word of every fine and checking its contents for one of two "trigger" values. If the first word matches the value, the program reads the statistics from the next fine in the file and puts them in the SAS data set being created. In Figure 2 the first pieces of information needed are the number of Y(1) and N(O) responses, which are immediately below the line beginning with the word "Value". When the macro finds that the first word of a line is Value", it reads the number of Yes responses from the next fine into the variable N_ YES, and reads the number of No responses from the following line, into N NO. Because of the RETAIN statement, these values will be retained through subsequent iterations of the DATA step. The next trigger value is "INTERCPT". If the first word in the line is not "Value", the program checks for the word "INTERCPT". If it finds it, the macro drops to the next line and picks up the variable name, the beta coefficient, the standard error, the Wald chi squared, the p-value of the chi squared, and the standardized coefficient. The macro then calculates the odds ratio and 95% confidence Hmits, and OUTPUTs an observation to the SAS data set. The program will continue reading, and adding observations to the data set when it encounters trigger values, until it comes to the end of the print file. Then PROC PRINT is invoked to produce the report shown in Figure REPORTS USING OUTEST FILES Figure 3 shows a sample of an OUTEST data set produced with PAOC LOGISTIC. The code used to produce the output is: value yesnofmt O='N' 1 =' Y'; proc logistic noprlnt order=formatted outest=estimate covout; where race ne 1; model hlv=spdball; format hlv yesnofmt.; proc print data=estimatej In addition to the OUTEST = option that produces the data set, the COVOUT option is used to generate the covariance matrix. As before, the ORDER = FORMATTED option is selected, and NESUG '92 proceedings

5 462 statistics CBS _LlNK lype NAME_ INTERCEP SPDBALL 1 LOOIT PAAMS ESTIMATE _ LOOIT COV INTERCPT LOOIT COV SPDBALL Flaure 3 NOPAINT is also specified to suppress printed output. The statistics. needed to produce the report are highlighted. The first observation contains the beta coefficient of the predictor variable. The third observation contains the diagonal element of the estimated covariance matrix, the square root of which is the standard error of the beta. The Wald chisquared is equal to the square of the beta divided by the standard error, and the p-value is computed with respect to the chi-squared distribution with one degree of freedom. The two statistics needed for the report that are not in the OUT EST data set are the number of yes and no responses. These can be computed by using PAOC MEANS. The macro BILOGS2 uses an OUTEST data set, and a data set produced by PAOC MEANS to produce the report in Figure 1. The parameters used in the macro are identical to those in BILOGS. The code for BILOGS2 follows (the part of the code using PAOe PAINT is identical to that in BILOGS and is omitted): %macro bilogs2(data=,xlist=,y=,where=, tltle=); r parse xllst to create xi xn & nvars./ %mparse(x=&xlfst) ; value ynfmt O='N' 1=' Y'; /*.* Do regressions with estimates placed in outest files.* */ %do 1:1 %to &nvars; proc logistic nosimple data:&data order=formatted noprlnt covout outest=dd&l; %If &where ne %then %str(where &wherej); model &y = &&x&l; format &y ynfmt.j 1* **** Now create files containing parameters for each x variable *** */ %do I = 1 %to &nvars; data est&lj set dd&lj retain beta; /"1st obs of each est file contains beta */ If _n_=1 then beta = &&x&l; /* 3rd record contains std error */ if _n_=3 then do; se=sq rt( &&x& I); 1* From beta, std err compute other stats */ ch 1= (beta/se )**2; pvalue:1 probchi(chl,1 ); oddratlo=exp(beta); low95=exp(beta 1.96*se); up95=exp(beta+1.96*se); name=" &&x&i"; depvar="&y"; output; %endj keep beta se chi odd ratio low95 up95 pvalue depvar name; 1* * Combine data sets containing stats" */ data ddest; set %do I = 1 %to &nvarsj est&1 ; 1* **** Now compute number of 0 and 1 answers for dependent var *** *f proc means data=&data noprlnt; %If &where ne %then %str(where &where and (&y=1 or &y=o)j)j %else %str(where &y=1 or &y=oj)i class &y; var &xllst; output out=yes_no n= NESUG 192 Proceedings

6 Statistics 463 1* create one obs for each variable *' proc transpose out=y_n; where _type_=1; data yes_no; set y_n; %do 1=1 %to &nvars; If upcaselname.j=upcase("&&x&i") then do; n_no = col1; njes=coi2; output; keep njes n_no; 1* Add Info to data set containing stats *' data ddest: merge ddest yes_no; 1* Now use Proc Print to print report *' %mend bllogs2; As with BllOGS, the macro call is: %bllogs2( data= risk,y= hiv, xllst=spdball crack Inj4_day, where=race ne 1, tltie=macro EXAMPLE); BilOGS2 starts with MPARSE. But instead of proceeding to use PRINTTO, the macro does one regression for each of the 3 independent variables and puts the estimates out to separate OUT EST files: 001, 002 and 003. Each file contains 3 observations, one with TYPE =PARMS and two with _TYPE_=COV (See Figure 3). A loop then computes statistics for each OUTEST data set, taking the beta from the first observation and the standard error from the third observation, and outputting three data sets, EST1, EST2 and EST3. These are then combined into a single data set,ddest. create a data set with one observation for each independent variable, with each observation containing the number of yes and no responses. Finally, this data set is MERGEd with DDEST to produce a data set that has all the relevant information, and can be used as input for PROC PRINT to produce the report. CONCLUSION The macros shown here are "bare bones" models of somewhat more elaborate ones that include options to print independent variable labels and specify the width of the confidence intervals. In addition, there are a number of other macros, including ones that summarize 2X2 contingency table output, t-tests and disease incidence risk ratios, that have been developed for the Risk Factors study and are used regularly at NDRI (code for these is available from the author). Regardless of how complex the macros are, however, all build upon the two methods presented in this paper. In general, the method using PROC PRINTTO has the advantage of being fairly straight forward: one simply "picks up" from the print file the information he or she looks for when going through printouts manually. In addition, a knowledge of exactly how statistics are calculated is not required. On the down side, whenever the format of the printed output from a procedure changes, either from one version to the next or across platforms, the macro must be revised. The method using OUTEST and OUTPUT files has the advantages of increased flexibility, in that one is not limited to the information on a print out, and portability. On the other hand, the programmer has to have a detailed knowledge of how certain statistics are calculated, and, since information may have to be drawn from different PROCs (as in the example presented here) and may require more file manipulation, more programming sophistication is needed. Next, the macro computes the number of HIV=1 and HIV=O responses for each ot the regressions. For this, the program uses PROC MEANS to create a data set with two records, the first containing the number of observations with HIV=O and the second the number of observations with HIV=1 for each of the 3 independent variables. PROC TRANSPOSE is then used in conjunction with a DATA step to

Stat 5100 Handout #14.a SAS: Logistic Regression

Stat 5100 Handout #14.a SAS: Logistic Regression Example: (Text Table 14.3) Individuals were randomly sampled within two sectors of a city, and checked for presence of disease (here, spread by mosquitoes).