Cut Out The Cut And Paste: SAS Macros For Presenting Statistical Output ABSTRACT INTRODUCTION

Size: px

Start display at page:

Download "Cut Out The Cut And Paste: SAS Macros For Presenting Statistical Output ABSTRACT INTRODUCTION"

Shana Greene
6 years ago
Views:

1 Cut Out The Cut And Paste: SAS Macros For Presenting Statistical Output Myungshin Oh, UCLA Department of Biostatistics Mel Widawski, UCLA School of Nursing ABSTRACT We, as statisticians, often spend more time cutting and pasting from pure SAS output in making a report for clients, than we spend writing the programs. This is especially true, when we perform the same PROC step for many variables over and over and get hundreds pages of output, we need to somehow summarize them. In order to save time, these Macros produce a summarized version of output containing just the information needed. We have developed Macros for paired t-test, reliability analysis (Cronbach s alpha), Wilcoxon test, and analysis of variance. The Macro for paired t-test combines information from PROC MEANS, PROC TTEST, and PROC CORR. We often need to present reliability information on a number of scales and sub-scales simultaneously; this macro produces a condensed table. Combining the information from PROC NPAR1WAY and PROC MEANS, our Wilcoxon macro presents a condensed table including Wilcoxon statistics and medians, which are more understandable by clients than the rank sums. The last of our macros is designed for repeated measure analysis using PROC GLM. It produces as simple table as containing main ANOVA table with descriptive statistics in each time and each combination of classes, eliminating much extraneous information. INTRODUCTION The job of a statistician is taken up with preparing data for analysis, determining the appropriate analysis, and programming statistical procedures. Unfortunately much of the time is actually spent cutting, pasting, and reformatting summarized but informative reports for clients. Much of the time SAS output is much too voluminous to present to the client just as it is produced. And often the output of more than one procedure needs to be combined for meaningful reports. This requires long and tedious work, such as cutting the essential information from SAS pure output, pasting it to separate editor program and rearranging it for the client to catch the information easily. The complete output from containing all possible information is often only meaningful for statisticians, and distracting to clients. The output often needs to be rearranged to clearly present what has been found. Finally, information from disparate procedures is more meaningful when presented in single tables. Frequently we perform the same PROC step for many variables over and over and produce hundreds pages of output; however we usually only need the few key values for each variable to present to the client. For example, after performing PROC GLM or PROC TTEST for several variables, we go through all the output detail to check all statistical information. However, if all of the diagnostics look good, all we want to report for the client is the values of the statistic computed, degrees of freedom, and p-values from each test. What we are left with is an endless simple cut and paste job. To save our time in cutting, pasting and editing for presenting statistical output, we wrote SAS Macro programs to produce the SAS output from repeated PROC steps, save it as SAS data sets, and create a condensed table with the information we need displayed in the format we want. The names of the target data sets, target variables, and some options to control the procedure(s) are supplied to the macro in the form of macro parameters. The output of the macros is the condensed, informative, and hopefully, clear output usually presented SAS LOG WINDOW. The four macros are listed here: 1. PTTEST: Paired t-test with correlations. 2. WILCOX: Wilcoxon test (two sample rank-sum test) and medians 3. GLM: Analysis of variance limited to type III sum of squares and mean and standard deviations. 4. ALPHA: Reliability analysis (Cronbach s alpha) on a number of scales and sub-scales simultaneously. Supplemental material, including SAS source code for the macros, sample programs that call the macros, and data for these macros, is available on the web ( There will be links to the macros from this staff web page at the UCLA School of Nursing Web site. Since you may find some of our decisions differ from your preferences, the macros are presented completely commented. This should enable to modify them for your requirements. We have found that the time for producing these macros is approximately equal to the programming and cutting and pasting involved in 2½ moderately large runs involving these procedures. If you find our choices adequate for your needs then you have saved even that overhead. 1

2 PAIRED T-TEST Paired t-tests are used to test for differences between two means when measures are taken at two time points, or those being measured are matched in some way and independence cannot be assumed. Standard SAS output from PROC TTEST for paired samples contains more information than is usually needed for presentation to the client. Statistics like the confidence limits of mean differences, and confidence limits of standard deviations of differences usually are not meaningful to the client. In fact, clients prefer seeing actual means rather than mean differences. Clients often would like to see correlations between measures (or two time points) in addition to the standard t-test output. The PTTEST macro produces a single condensed table containing: 1. Basic descriptive statistics for each of the two variables including means, standard deviations, medians, the minimum values, and the maximum values 2. T-test statistics and corresponding p-values 3. Pearson correlations between the two variables and corresponding p-values We have found that this is the information our clients usually request, and easily understand. SAMPLE MACRO CALL AND OUTPUT This is a sample macro call for the PTTEST macro that uses the TRIAL data set. This data set is available on our web site and also available at the SAS site. It is one of the sample datasets for use with SAS examples. In the TRIAL data set, two responses, Y1 and Y2, are each measure three times for each subject (pre-treatment, post-treatment, and in a later follow-up). Thus, the variable PREY1 is the value on measure Y1 pre-treatment. We want to perform four t-tests to see whether there are significant different between pre and post, and between post and follow-up (for variable Y1 and Y2). We provide values for the parameters required by this macro through the definition of Macro variables. The macro variables LIBR and data provide the input library name and the input data-set name, respectively. These two macro variables are common in other macros. The number of pair(s) to be compared is 4 for npair macro variable, and listv1 and listv2 have the first and second variable names for each pair. In out sample input below, the first variable name in listv1, PREY1 and the first variable name in listv2, POSTY1 (post treatment for y2) is one pair-variables for paired t-test. Title is for the output title. Then, call the PTTEST macro with these all macro input macro variable variables. %let LIBR = work; /* input library name */ %let data = trial; /* input data set name */ %let npair = 4; /* number of pairs of variables */ %let listv1 = prey1 prey2 posty1 posty2; /* 1st variable of pair (e.g. pre) */ %let listv2 = posty1 posty2 foly1 foly2 ; /* 2nd variable of pair (e.g. post) */ %let title = (Pre vs. Post) & (Post vs. Follow-up); %pttest (&LIBR, &data, &npair, &listv1, &listv2, &title); RUN; The output produced by this macro appears in LOG WINDOW in the MS Windows version of SAS. It is reproduced in the box below. As you can see information from a number of statistical procedures are combined in a single easily accessible table. Mean, median, standard deviation, minimum and maximum are obtained from PROC MEANS. The t-test is obtained from PROC TTEST, and the correlations from PROC CORR. Any title you choose will be used for the table. The output for each of the 4 pairs is presented separated by a dotted line. ****** PAIRED T TESTS : (Pre vs. Post) & (Post vs. Follow-up) TTest Corr Variable Label N Mean Med STD Min Max Df (ProbT) (ProbF) PreY1 Pre Y PostY1 Post Y (0.0003) (0.8379) PreY2 Pre Y PostY2 Post Y (0.0797) (0.1850) PostY1 Post Y FolY1 FollUp Y (0.2435) (0.1686) PostY2 Post Y FolY2 FollUp Y (0.0005) (0.9879) This example was chosen because it was available not because it is a good example of the use of the procedure. But it does afford an opportunity to explore each statistic presented. The medians are approximately equal to the means 2

3 so we can tell that the distribution is at least symmetrical. The standard deviations for the most part are equivalent, though the Y2 follow-up may be a problem. Where there appears to be a significant difference the pre versus post for Y1 the difference is also reflected in the medians, and in the minimum and maximum values. This precludes worry about a ceiling effect. However The lack of relationship between pre and post measures for Y1 implies the people are responding to the treatment differently. On the other hand even thought there is a trend toward a decrease for prepost Y2, the mild correlation indicates that there is some tendency toward equal change in Y2 for different subjects. It might be useful to contrast the table above to the output you would obtain from running each of the procedures separately. This output is presented below. The MEANS Procedure Variable Label N Mean Median Std Dev Minimum Maximum PreY1 Pre Y PreY2 Pre Y PostY1 Post Y PostY2 Post Y FolY1 FollUp Y FolY2 FollUp Y page break here The TTEST Procedure Statistics Lower CL Upper CL Lower CL Upper CL Difference N Mean Mean Mean Std Dev Std Dev Std Dev Std Err Minimum Maximum PreY1 - PostY PreY2 - PostY PostY1 - FolY PostY2 - FolY T-Tests Difference DF t Value Pr > t PreY1 - PostY PreY2 - PostY PostY1 - FolY PostY2 - FolY page break here The CORR Procedure 4 With Variables: PostY1 PostY2 FolY1 FolY2 4 Variables: PreY1 PreY2 PostY1 PostY2 Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum Label PostY Post Y1 PostY Post Y2 FolY FollUp Y1 FolY FollUp Y2 PreY Pre Y1 PreY Pre Y2 Pearson Correlation Coefficients, N = 18 Prob > r under H0: Rho=0 PreY1 PreY2 PostY1 PostY2 PostY Post Y PostY Post Y FolY FollUp Y FolY FollUp Y

4 Notice that this output is much more verbose, and elements are not easily assimilated together. In addition the program is longer than the simple macro call. The statistics of interest, such as correlation, are often buried in a larger table. As more comparisons are added the hunt, cut, and paste time increases. WILCOXON TWO SAMPLE RANK SUM TEST The Wilcoxon rank sum test is used when the criterion variable is not normalizable and two groups are to be compared. The WILCOXON macro presents a condensed table for Wilcoxon two sample rank-sum test, including Wilcoxon statistics. Medians are presented instead of the rank sums in standard NPAR1WAY output as they tend to be more accessible to our clients. While it is possible to have a significant Wilcoxon test when the medians are the same, this tends to raise doubts about the adequacy of the test for the purpose. The condensed output the macro produces contains medians and numbers for target variables for each of the two groups, combined with Wilcoxon s rank-sum test statistics and their corresponding p-values. In addition p-values less than or equal to 0.05 are flagged with *******, and p-value less than or equal to 0.1 are flagged with *. SAMPLE MACRO CALL AND OUTPUT Let us assume that we want to compare the effects for treatment A with treatment B separately at each time period for both variables Y1 and Y2. In addition, suppose that the variables of interest are completely ill conditioned for parametric analysis. In that case we may find Wilcoxon two sample rank-sum tests more appropriate. Here we are using the TRIAL data set even though there is no evidence that parametric statistics are inappropriate here. The where macro variable supplies a WHERE statement, similar to what you would use in PROC NPAR1WAY, that restricts the target data set. Since this sample data set has more than 2 groups in the TRT variable, this macro variable is very useful to restrict the analysis to the 2 target groups. The class macro variable supplies the name of a group variable, and the var macro variable contains the name(s) of the target variable(s) to be analyzed. %let LIBR = work ; /* input library name */ %let data = trial; /* input data set name */ %let where = where (trt="a" or trt="b") ; /* WHERE statement in NPAR1WAY */ %let class = trt ; /* class (group) variable */ %let var = PreY1 PostY1 FolY1 DiffY1 PreY2 PostY2 FolY2 DiffY2; %let title = Between Treatments (A vs. B); %wilcox(&libr, &data, &where, &class, &var, &title); RUN; The Wilcox macro output is presented below. For each target variable, the medians and the numbers of observation for each of the two groups, and test statistics comparing the two groups are aligned in a row. Follow-up of Y2 (FOLY2) and difference between pretreatment and follow-up of Y2 (DIFFY2) are significantly different (p-value <=0.05) between treatments A and B. These significant comparisons are flagged in the output for quick scanning. The test of the difference is analogous to the test of an interaction between time and group in an analyses of variance. We present one sided tests as we always know the direction, the macro can be easily modified to perform two sided test. WILCOXON - Between Treatments (A vs. B) Summary Stats are Medians Values in Parentheses () are N's Tests are Approximate t from Wilcoxon Rank Sums NAME A B Approx t p>=t PreY PostY FolY DiffY PreY PostY FolY ******** DiffY ******** 4

5 Note that unlike parametric statistics where the mean difference should equal the differences of the means when the N is constant, the median of the differences does not necessarily equal the median of the differences. Compare this with the output of the standard procedures presented below for just two of the six variables in our table. In addition the following output does not contain medians. Wilcoxon - Between Treatments (A vs. B) 1 The NPAR1WAY Procedure Wilcoxon Scores (Rank Sums) for Variable PreY1 Classified by Variable Trt Sum of Expected Std Dev Mean Trt N Scores Under H0 Under H0 Score A B Average scores were used for ties. Wilcoxon Two-Sample Test Statistic Normal Approximation Z One-Sided Pr < Z Two-Sided Pr > Z t Approximation One-Sided Pr < Z Two-Sided Pr > Z Z includes a continuity correction of 0.5. Kruskal-Wallis Test Chi-Square DF 1 Pr > Chi-Square page break here Wilcoxon - Between Treatments (A vs. B) 2 The NPAR1WAY Procedure Wilcoxon Scores (Rank Sums) for Variable PostY1 Classified by Variable Trt Sum of Expected Std Dev Mean Trt N Scores Under H0 Under H0 Score A B Average scores were used for ties. Wilcoxon Two-Sample Test Statistic Normal Approximation Z One-Sided Pr < Z Two-Sided Pr > Z t Approximation One-Sided Pr < Z Two-Sided Pr > Z Z includes a continuity correction of 0.5. Kruskal-Wallis Test Chi-Square DF 1 Pr > Chi-Square page break here

6 ANALYSIS OF VARIANCE The GLM macro is designed for a more condensed output from PROC GLM. It will handle repeated measures analysis or independent groups analysis at a single time period. It produces a main ANOVA table for Type III sum of squares, and mean and standard deviations at each level of the class variable and at each time point. All of the output comes from PROC GLM. Incomplete models may be specified. This macro is most useful when the same model will be used for a number of different dependent measures (e.g. Y1 and Y2). The repeated measures are provided and time periods represented as separate variables in the dataset. For example, PREY1 is the measurement of Y1 before treatment; POSTY1 is the measurement of Y1 after treatment. SAMPLE MACRO CALL AND OUTPUT This sample macro call performs a repeated measures analysis of variance. Time is the repeated factor with 3 levels (Pre, Post, Follow-up) and treatment is the between groups factor with 3 levels as well (A, B and Control). The data come from the same TRIAL data set we have seen earlier. Main effects and interactions will be included in the model. The nclass macro variable provides the number of class(independent) variables (factors) and the class macro variable provides name(s) of the grouping variables (factors) to be used in the GLM model. They are listed with a space between them as you would in a CLASS statement in PROC GLM. The model macro variable provides model specifications as you would in a MODEL statement in GLM. If the model specification is null, the full model with all class variables and interactions will be used. The means macro variable controls the output of variable means, it is specified as you would in a MEANS statement. In order to request means broken down by effects those effects must be included in the model macro variable as well. The remaining three macro variables provide information on how to interpret the dependent variable list. The ndep macro variable indicates the number of dependent measures (variable sets analyzed in separate analysis), the time macro variable indicates the number of time points (variables of measurements at each time point), and deplist lists all of variable names from the data set needed for analysis. Sets of variables for each measure are separated by a (vertical bar). There should be as many variables in each set of variable names on the deplist as there are time periods. In the example below we are using two different measures Y1 and Y2, and measuring them at three time periods each. %let LIBR = work ; /* input library name */ %let data = trial; /* input data set name */ %let nclass = 1 ; /* number of class(indep.) var. */ %let class = trt; /* names of class(indep.) var.(s)*/ %let model =; /* model specification, leave */ /* null for the full model */ %let means = trt; /* MEANS statement in GLM */ %let ndep = 2 ; /* # of dep. Var. sets (measures)*/ %let time = 3 ; /* # of time periods(variables) */ %let deplist = PreY1 PostY1 FolY1 PreY2 PostY2 FolY2; %glm(&libr, &data, &nclass, &class, &model, &means, &ndep, &time,&deplist) ; RUN; As the sample macro call indicates, two separate PROC GLM s will be run using the same GLM model. We provide PROC GLM with two dependent variable sets of three variables each (for the 3 time periods). Notice that for Y1 there is significant time effect and we can see the trend by checking the means at each time point and in each group. That is, each group s means for Y1 increase across time. Though, it seems that the increase for the control group is slightly slower reaching the peak by follow-up rather than immediately after treatment as were done by groups A and B, no significant interaction was detected. For Y2, the time main effect, and the time by treatment interaction effect are significant. Generally, if an interaction effect is significant that drives the discussion. When the interaction is significant you must take treatment into account when discussing time, and you must consider time when discussing treatment. Scanning the means for the measure Y2 reveals that the source of the interaction depend on increases over time for the CONTROL and treatment B as compared with the relatively stable treatment A. 6

7 ******* Summary of GLM (Repeated : 3 time-point): Model = TRT ########### Dependent Variables : Time1=PREY1 Time2=POSTY1 Time3=FOLY1 PRE Y1 POST Y1 FOLLUP Y1 SOURCE DF SumSquare MeanSquare F Value Pr > F BetweenS Trt BetweenS Error WithinSu time <.0001 WithinSu time*trt WithinSu Error(time) Means/Stdev in each Level of Class variables -- TRT N Mean_T1 Sd_T1 Mean_T2 Sd_T2 Mean_T3 Sd_T3 A B Control ########### Dependent Variables : Time1=PREY2 Time2=POSTY2 Time3=FOLY2 PRE Y2 POST Y2 FOLLUP Y2 SOURCE DF SumSquare MeanSquare F Value Pr > F BetweenS Trt BetweenS Error WithinSu time <.0001 WithinSu time*trt <.0001 WithinSu Error(time) Means/Stdev in each Level of Class variables -- TRT N Mean_T1 Sd_T1 Mean_T2 Sd_T2 Mean_T3 Sd_T3 A B Control In retrospect, we could have further condensed the output into a single table summarizing the results followed by tables of means. In such a table the error variances may be eliminated and the output restricted to the degrees of freedom, the F values, and the p-values. We took the approach we did since most of our clients would like to see the means with the presentation of each analysis. If we develop an additional method of presenting the results the macro will be referenced at the same web site. RELIABILITY ANALYSIS Reliability analysis is commonly used to assess internal consistency of several variables that measure a single underlying factor / latent structure. One of the most common measures of internal consistency of a scale is Chronbach s Alpha. Very often data contains several sub-scales, and clients would like information on the reliability of each sub-scale as well as the scale as a whole. Especially in scale construction it is often useful to present information on alpha with item deleted. This is useful for determining items that do not seem to fit with the over all scale, and also items that may need to be reversed. All of this information and more is presented when the ALPHA option is requested on PROC CORR; however, presentation of this information by PROC CORR separates the basic information on scale and sub-scale reliability. The ALPHA macro presents a summary table of Cronbach s alphas for scales and sub-scales at the beginning of the output. If the option to present the alphas with the item deleted ( alpha with deleted variable ) is chosen that is presented for each scale or sub-scale in turn following the summary table. Finally complete correlation matrices are also available on request, but these are presented in the OUTPUT WINDOW instead of the LOG WINDOW. SAMPLE MACRO CALL AND OUTPUT The PSYCHDAT data set has 6 items assessing metal stability of psychiatric patients. Assume that there are two different scales; SCALE1 contains ANXIETY, DEPRESS, and SLEEP, and SEX, LIFE and WGHCHG (Weight Change) are for SCAL2. The sample macro call below requests reliability information for each of these two scales and for the total scale. The ncorr macro variable specifies the number of scales, which is 3 in this example. The nv, var, and scale macro variables are the number of variables in each scale, the variable names, and a label for each scale(s), respectively. Thus, in the example below the first scale contains 3 variables; those variables are ANXIETY, DEPRESS, and SLEEP; and the scale is labeled SCALE 1. Information for each scale should be separated by and be arranged in the same order on the %LET statements for each macro variable. The dvar macro variable controls the production of tables for alpha with deleted variable. Code dvar =1 on the %LET statement to produce alpha 7

8 with deleted variable in LOG WINDOW. If you want to see the original PROC CORR output in OUTPUT WINDOW, set the prtout macro variable equal to 1. %let LIBR = work; /* input library name */ %let data = psychdat; /* input data set name */ %let ncorr = 3; /* number of scales */ %let nv = 3 3 6; /* number of variables in scale */ %let var = Anxiety Depress Sleep Sex Life WghChg Anxiety--WghChg; %let scale = scale 1 scale 2 Total scale; %let dvar = 1; /* 1 requests Alpha with deleted variable */ %let prtout = ; /* 1 requests PROC CORR output in WINDOW */ %alpha (&LIBR, &data, &ncorr, &var, &scale, &nv, &dvar, &prtout); RUN; Notice that we are not requesting the complete print out in the example above, but we are requesting to see the alpha with deleted variable presented below the summary table. Consider the output produced by the ALPHA macro as specified in our sample call. The first section is raw, standardized Alpha and variable names for each scale presented in a single table. This is the default output of the macro. Since we specified dvar=1 above, the alpha with deleted variable section for each scale will follow this summary table. Notice that the TOTAL SCALE, which includes all 6 variables, has a higher standardized alpha score of than either of the sub-scales. When we examine the alpha with deleted variable section for the TOTAL SCALE we can see that the standardized alpha increases when the variable SEX is removed from the scale. Thus, removing that variable would make the TOTAL SCALE more internally consistent. We use the standardized alpha for this because the variables used in this example are not scaled consistently. ****** Cronbachs ALPHA ******** ---- A L P H A ---- SCALE # of var. N RAW STANDARDIZED scale scale Total scale Variables in Scale scale 1 Anxiety Depress Sleep scale 2 Sex Life WghChg Total scale Anxiety--WghChg ******* Cronbach Alpha With Deleted Variable ******* scale 1 - Raw Alpha: Std Alpha: Raw Variables Standardized Variables Deleted var Corr.W/Total Alpha Del Corr.W/Total Alpha Del Anxiety Depress Sleep scale 2 - Raw Alpha: Std Alpha: Raw Variables Standardized Variables Deleted var Corr.W/Total Alpha Del Corr.W/Total Alpha Del Sex Life WghChg Total scale - Raw Alpha: Std Alpha: Raw Variables Standardized Variables Deleted var Corr.W/Total Alpha Del Corr.W/Total Alpha Del Anxiety Depress Sleep Sex Life WghChg

9 CONCLUSION We hope you will find these macros useful in saving time spent on cutting, pasting and reformatting. Feel free to use the macros as is or modify them as needed. Distribute the macros to others please leave the attributions to us intact in the macros. If you modify them add your attribution to ours. We have presented four macros above for modifying SAS output for presentation to clients. These macros are: 1. PTTEST: Paired t-test with correlations. 2. WILCOX: Wilcoxon test (two sample rank-sum test) and medians 3. GLM: Analysis of variance limited to type III sum of squares and mean and standard deviations. 4. ALPHA: Reliability analysis (Cronbach s alpha) on a number of scales and sub-scales simultaneously. One caution should be noted: if you use these macros but your variable naming conventions differ markedly from ours (i.e. super long variable names), or formats used are ones we have not anticipated, then the macros may not work correctly. For those interested in the internals of the macros we make use of these features of SAS: 1. Macro variables, and Macro statements to drive the program. 2. PROC CONTENTS to capture variable names, labels, and formats. 3. Output statements in procedures to produce SAS data sets containing the statistics of interest. 4. ODS specifications in procedures to produce additional SAS data sets containing statistics. 5. Data management procedures such as PROC SORT and PROC TRANSPOSE to manipulate file to make them easier to use. 6. Data steps for merging, reformatting, and producing tables. In addition to the macros presented here, there are some additional macros available on our web site, including the following: 1. XMEANS which presents all of the groups for each variable together rather than all variables grouped together for each group. 2. UNIFORM presents descriptive statistics along with information on normality and skew. 3. CHI presents summaries of chi-square statistics for a number of tables in a single condensed table. (always examine the complete output for error messages as well) In addition, a number of general-purpose file and data management macros will be made available at this site. REFERENCES Art Carpenter. (1998), Carpenter s Complete Guide to the SAS Macro Language, NC: SAS Institute Inc. SAS Institute Inc., SAS/STAT User s Guide, Version 8, Cary, NC: SAS Institute Inc., pp. SAS Institute Inc., SAS Language Reference, Version 8, Cary, NC: SAS Institute Inc., pp. SAS Institute Inc. (1990), SAS Guide to Macro Processing, Version 6, Second Edition, Cary, NC: SAS Institute Inc. ACKNOWLEDGMENTS We would like to acknowledge our clients, especially Drs. Donna Vredevoe and Deborah Koniak-Griffin, whose requirements drove us to develop these macros. CONTACT INFORMATION MyungShin Oh UCLA adumas@hanmail.net Mel Widawski UCLA mel@ucla.edu 9

Introduction to Statistical Analyses in SAS

Introduction to Statistical Analyses in SAS Programming Workshop Presented by the Applied Statistics Lab Sarah Janse April 5, 2017 1 Introduction Today we will go over some basic statistical analyses in