Using MACRO and SAS/GRAPH to Efficiently Assess Distributions Paul Walker, Capital One INTRODUCTION A common task in data analysis is assessing the distribution of variables by means of univariate statistics, and graphs such as histograms, scatter plots, box plots, etc. This task may be relatively straight-forward when analyzing a small dataset with only a handful of variables. However, when the data set is large and contains hundreds or even thousands of variables, the task can be daunting. The method described in this paper was motivated by the need to efficiently examine the distributions of a large number of variables. The method is characterized by the following: It is an improvement over proc univariate insofar as the output is more targeted to what you want to see. It provides an automated way to create graphs for a large number of variables. It provides a way to include summary statistics on your graphs. It integrates multiple graphs into one.gif file. SOLUTION TO THE PROBLEM A method was developed to automatically create the style graph shown in Figure 1 for each variable specified by the analyst. For each variable, the graph is contained in a.gif file that is named usually according to the name of the variable being graphed. The analyst specifies the variables in the form of a single-spacedelimited list. Within a Windows operating system environment, one can quickly wade through large numbers of graphics files using the Thumbnails or Filmstrip viewing options in Windows Explorer. I find this much more convenient than putting the graphs into a word or PDF document, which makes searching for an individual variable difficult. Moreover, having individual.gif files is useful if you want to organize the variables into different categories. In predictive model building, you might make folders for variables you want to discard, and variables you want to include in the model. Figure 1: Histogram / Scatter Plot Pair There are essentially four steps in the macro that creates the graph shown in Figure 1: 1. Parsing the user-specified variable list into macro variables indexed by an integer. 2. Getting summary statistics from other procs into macro variables. 3. Creating multiple graphs, including putting summary statistics in the footnote section of each graph. 4. Combine your graphs into one.gif file using proc greplay. I will describe the general idea of the code in each of these four steps. The reader should be familiar with SAS/GRAPH and the MACRO language. STEP 1: PARSE THE VARIABLE LIST The analyst must specify the list of variables that he/she wants to examine. For example, suppose the analyst has the following variable list: age height weight fastgluc postgluc 1
We need to assign these variable names to macro variables indexed by an integer. Thus, we would want the following macro variables and corresponding values: Macro Variables Values Var1 age Var2 height Var3 weight Var4 fastgluc Var5 postgluc nvars 5 The code in Step 1 of the Appendix performs this parsing. The variable list may contain any number of variable names. Now that we have the variable names in macro variables indexed by an integer, we can iterate through each variable name via a macro do loop, e.g. %do i=1 %to &nvars. When you want to refer to the i th variable, you would use the double ampersand technique &&var&i. For example, when i=3, &&var&i resolves to weight. STEP 2: GET SUMMARY STATS The purpose of the second step is to obtain those key summary statistics that you really want to see, and put them into macro variables. Once they are stored in macro variables, you can write them into the footnotes of your graphs produced in step 3. I will illustrate this trick using proc means. First, I use proc means to create a temporary output dataset named means. proc means data=&lib..&ds; var &&var&i; output out=work.means median(&&var&i)=median; This temporary dataset has one observation. It will contain the automatic variables _TYPE_ and _FREQ_, and the user-created variable median. Thus, to put the value of the median into a macro variable with the same name, we use the call symput technique. data _null_; set means; call symput( median,trim(left(median))); Hopefully the reader is able to distinguish the three uses of the word median in the above blocks of code. One use is as a function in proc means, the other is as a regular variable name, and the other is as a macro variable name. In short, the preceding code puts the value of the median of the i th variable into the macro variable median, which can then be called via &median. In practice, you would usually want to be more descriptive than just the median. STEP 3: CREATE GRAPHS The reader should already be familiar with the syntax of proc gchart and proc gplot. What I will illustrate is how to write summary statistics into a graph s footnote. To do so, you need to use a footnote statement with the call to the macro variable in double quotation marks. For example, a histogram could be produced as follows: proc gchart data=&lib..&ds; vbar &&var&i; footnote the median is &median ; You can create multiple graphs in this fashion (for example, histograms and scatter plots for the i th variable), and then in step 4 we put them together using proc greplay. STEP 4: USE PROC GREPLAY Before getting to the actual greplay proc, there are some options and pieces of code you need to specify. Since we want to output each set of graphs in a single.gif file, you will need to specify the filename as well as the device driver. The following code achieves this: goptions device=gif gsfname=nesug; filename nesug C:\Graphs\&&var&i gif ; Here the file is named with the variable s name. Alternatively, you could precede the variable name with some descriptive phrase, such as graphlist_. The triple period in the filename statement is necessary to resolve the double ampersand reference &&var&i. Recall that we only want one.gif file created for every variable. To achieve this, you must do two things. First, you must store each individual graph (histogram, scatterplot, etc.) that you produce for a given variable into a temporary sas catalog. Second, you must ensure that these individual graphs are not written to.gif files, i.e. that we do not create any extraneous.gif files. The following code achieves both of these: goptions nodisplay; proc gchart data=&lib..&ds gout=work.gseg; vbar &&var&i / name= histo ; Essentially, this code turns off the display, which prevents.gif output files from being produced. Later on, immediately before we use proc greplay, we will turn the display back on. The options gout= and name= create a catalog entry histo in the work.gseg temporary catalog. Using SAS Explorer, you can view the contents of the catalog (see Figure 2). I have also created the entry Scatter into the catalog. 2
you specify the catalog entries you want to put together. Since the template V2 has slots for two graphs, you must specify which goes in position 1 and 2. Figure 2: Temporary Catalog with Graphics Entries We will now use proc greplay to put the two catalog entries together in a.gif file. You must first turn the display back on, so that the.gif file will be produced. goptions display; You must also specify the template catalog which contains the template you want to use, as well as the particular template from it. It is possible to create custom templates, but I usually use one of the default templates provided in the sashelp.templt catalog, which can be browsed via SAS Explorer, as in Figure 3. The code I have described above works fine if you only graph one variable. For multiple variables, you have to add one extra block of code, and here s why. If you try to create a catalog entry named histo when an entry with that name already exists, then sas will automatically name your catalog entry histo1 and keep increasing the trailing integer for each additional entry you try to create named histo. Thus, the macro must include code to delete the catalog entries created by the previous variable in the iteration. The code should be placed before any of the graphics statements (proc gchart, proc gplot, etc.). The following code would delete catalog entries named histo and scatter from the temporary work.gseg catalog. proc catalog catalog=work.gseg; save histo.grseg scatter.grseg; delete histo.grseg scatter.grseg; The save statement makes sure that only the histo and scatter entries exist in the catalog, thus eliminating any extraneous catalog entries. Then the delete statement deletes them. You may see an error statement in the log on the first iteration, because the catalog entries histo and scatter do not yet exist. PUTTING IT ALL TOGETHER In the Appendix I have provided the complete code necessary to create the graph in Figure 1. This graph is based on the sasuser.diabetes dataset. The macro call used to create this (and other graphs not shown) was: %let mylist = age height weight fastgluc postgluc; %graphlist ( lib=sasuser, ds=diabetes, ivlist=&mylist, dv=pulse, path=c:\graphs); Figure 3: Browsing the sashelp.tmplt Catalog The following code puts the catalog entries named histo and scatter together using the V2 template. proc greplay nofs igout=work.gseg; tc sashelp.templt; template v2; treplay 1:histo 2:scatter; The igout= option specifies which catalog you are pulling entries from. The treplay statement is where 3 Notice that I used a macro variable named mylist to contain the variable list, which I then referenced in the macro call. If your list is very long, you might even want to include your list of variables in an external macro file, for example: %macro mylist; age height weight fastgluc postgluc %mend mylist; You would then reference it as %mylist instead of &mylist. Either way, the above macro call will produce five different graphs, which will be put into your C:\GRAPHS folder. A snapshot of the folder using
thumbnails view in Windows Explorer appears in Figure 4. When analyzing hundreds of variables, you may find that having one.gif file for each variable makes browsing and categorizing the variables much easier than having a several hundred page document produced (for example, by proc univariate). Figure 5: Logit Plots / Transformations Example Figure 4: Windows Explorer View of the.gif Files In practice, you will probably want to have output more tailored to your needs than a histogram / scatter plot pair with a few summary statistics included. For example, when building predictive models where the response variable is binary, you might use the technique described in this paper to create a different set of graphs with different summary statistics. For instance, Figure 5 shows four different plots in one.gif file. Each plot is a logit plot with a frequency plot overlay. The upper left graph is for the original variable, and the other three plots are common transformations that might be made when model building (square root, log, and reciprocal). In conclusion, if you follow the four steps outlined in this paper, you should be able to produce customized individual.gif files containing graphs and summary statistics that will hopefully make whatever analysis you are doing more efficient. EXTENSIONS This paper has described how to put multiple graphs into a single.gif file. However, I have not addressed the question of how to make a single title or footnote for all the graphs in the.gif file. For example, in Figure 5 it would be nice to have a title across the top which says logit plots for 4 transformations. This can be achieved via proc glside and creating a custom template. This approach is described in reference [1]. Finally, it should be noted that another approach to putting multiple graphs onto the same page is to use a PDF device driver. To do so, you would use the ods pdf option startpage=never as well as some graphics options such a vorigin=, horigin=, vsize=, and hsize= to position the graphs. This useful approach is described in reference [2], and is actually somewhat easier to program than using proc greplay as described in this paper. REFERENCES [1] Gayari, Michelle. Creating Graphs Using Templates. SUGI 22, paper 170. [2] Delaney, Kevin P. Multiple Graphs on One Page, the easy way (PDF) and the hard way (RTF). SUGI 28, paper 94. [3] SAS OnlineDoc, Version 8. SAS Macro Language Reference. [4] SAS OnlineDoc, Version 8. SAS Procedures Guide. 4
[5] SAS OnlineDoc, Vesion 8. SAS/GRAPH Software: Reference. CONTACT The author can be reached at: Paul Walker 15000 Capital One Drive, Building #2 Richmond, VA 23238 804-284-2311 walker.627@osu.edu DISCLAIMER SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. ************************************************************************************** APPENDIX: FULL MACRO CODE WITH COMMENTS %macro graphlist ( lib= /* data library */, ds= /* data set */, ivlist= /* list of independent variables to be graphed */, dv= /* dependent variable for scatterplot */, path= /* output path for.gif files */ ) ; STEP 1: Parse the independent variable list (&ivlist). *------------ Define variables used in parsing --------------------*; %let null = ; %let blank = %quote( ); %let fflag = 0; %let num = 0; *------------ Loop through the list of variables ------------------*; %do %until(&&fflag=1); %let num=%eval(&num+1); %let var&num = %scan(&ivlist,&num,&blank); %if &null=&&var&num %then %let fflag=1; %end; *------------ Number of variables in your list --------------------*; %let nvars=%eval(&num-1); STEP 2: Get summary statistics into macro variables. *------------ Start looping through each variable ----------------*; %do i = 1 %to &nvars; *------------ Get summary stats from proc means ------------------*; proc means data = &lib..&ds noprint; var &&var&i &dv; output out=means mean(&&var&i) = mean median(&&var&i) = median 5
n(&&var&i) = n nmiss(&&var&i) = nmiss; *------------ Put proc means output into macro vars -------------*; data _null_; set means; call symput('mean', trim(left(put(mean,best12. )))); call symput('median', trim(left(put(median,best12.)))); call symput('n', trim(left(put(n,best12. )))); call symput('nmiss', trim(left(put(nmiss,best12. )))); *------------ Get summary stats from proc corr ----------------*; proc corr data = &lib..&ds noprint outp=pearson; var &&var&i &dv; *------------ Put proc corr output into macro vars -----------*; data _null_; set pearson end=last; if last then call symput('corr', trim(left(put(&&var&i,5.3)))); STEP 3: Create graphs which include summary statistics. *----------- Delete catalog entries --------------------------*; proc catalog catalog=work.gseg; save histo.grseg scatter.grseg; delete histo.grseg scatter.grseg; *----------- Set general graphics options --------------------*; goptions device=gif gsfname=paul xpixels=800 ypixels=800 gunit=pct ftext=courier nodisplay; ods listing; filename paul "&path.\graphlist_&&var&i...gif"; *----------- Create histogram --------------------------------*; symbol; axis; title; note; footnote; axis1 label=(angle=90 height=4 "frequency") value=(height=3); axis2 label=(height=4 "&&var&i") 6
value=(angle=90 height=3); proc gchart data = &lib..&ds gout=work.gseg; vbar &&var&i / name="histo" levels=10 raxis=axis1 maxis=axis2 ; title height=5 "histogram of &&var&i"; footnote justify=center height=3 " " justify=center height=4 "mean=&mean, median=&median, nmiss=&nmiss, n=&n"; *----------- Create scatterplot ----------------------------*; symbol; axis; title; note; footnote; symbol1 v=triangle height=4 width=4; axis3 label=(angle=90 height=4 "&dv") value=(height=3); axis4 label=(height=4 "&&var&i") value=(height=3); proc gplot data = &lib..&ds gout=work.gseg; plot &dv * &&var&i / name="scatter" vaxis=axis3 haxis=axis4; title height=5 "scatterplot of &dv against &&var&i"; footnote justify=center height=4 " " justify=center height=4 "correlation between &dv and &&var&i is &corr."; STEP 4: Put the graphs together using proc greplay. *----------- Turn on the display --------------------------*; goptions display; *----------- Use proc greplay -----------------------------*; proc greplay nofs igout=work.gseg; tc sashelp.templt; template V2; treplay 1:histo 2:scatter; *----------- End of the "%do" loop -----------------------*; %end; %mend graphlist; 7