MACROS TO REPORT MISSING DATA: AN HTML DATA COLLECTION GUIDE Patrick Thornton, University of California San Francisco ABSTRACT This paper presents SAS macros to produce missing data reports in HTML. The reports are useful for informing manages and guiding the activities of data collection staff. The macros feature the CONTENTS and TRANSPOSE procedures, DATA step, and user information to profile missing data points overall, by variable, and by observation. The reports also incorporate summaries by groups of variables, and a customized menu is integrated with the menu created by the SAS Output Delivery System. INTRODUCTION The SAS macros in this paper produce missing data statistics and HTML reports to assist research staff in tracking survey data collection. The reports present various views of missing data in order to: (a) inform managers of the need to focus data collection resources on specific variables or groups of variables, and (b) create an organized and detailed guide for data collection staff. The macros allow the variables in a SAS data set to be grouped into two dimensions on which to summarize missing data rates. In this example, the dimensions are survey type (e.g. Juvenile Justice items) and time (e.g. survey items collected at 3-months): The first report was just summarized. The second report lists the number and percent of respondents missing for each variable in the data set. The third report shows the number and percent of missing variables for each observation/respondent in the data set. As an example, the above table counts the variables and expected responses overall (e.g. All) and within the values of two dimensions (e.g. Juvenile Justice, Intake). Expected responses are calculated by multiplying the number of variables by the number of observations. The percent of missing responses is the missing responses divided by the expected responses multiplied by 100. The percent-missing statistic informs managers on the missing rates for each group of variables/items. The macros were originally conceived to report missing data for a multi-site study of social services given to high-risk youth. Surveys were collected at different time points relative to a social service program: intake, 3-months, 6-months, 9-months, and 12-months. Each survey reported items from one of the following types: (a) alcohol and drug use, (b) education, (c) juvenile justice, and (d) mental health. All the data were stored in a large data set with some 510 variables and 120 observations, where each observation contained all survey responses for a participant. Missing Data Reports The macros for this paper generate five reports using two dimensions to summarize or list missing variables and/or observations. Reports 1, 2 and 3 were designed to allow project managers to review data collection progress by meaningful variable groups. Reports 4 and 5 were specifically designed to allow data collection staff to received feedback regarding which items need to be collected for each participant. These reports may be very long. The fourth report lists the observations/respondents according to percent items missing (e.g. 80-100% Items Missing), and the fifth report lists all variables in need of data collection for each participant missing data. OVERVIEW OF THE MISSING DATA MACROS This section will present a logical discussion of the macros with simplified syntax that are used to generate missing data statistics. The next section will cover HTML production. For additional detail please refer to the actual macros toward the back of the paper. Grouping Variables on a Variable List The primary function of macro %GVARCAT was to produce a data set of variable names that could be grouped by two dimensions. The first step simplified from the code is as follows: Proc contents data=original out=cats (keep=name); Run; The cats data set contains the variable name that stores the name of the variables in the original data set. The first DATA step in %GVARCAT uses the values of the variable
name to assign each observation in cats to a group on the new variables itemd and itemd2. The simplified code showing the creation of itemd is as follows: Data cats; Set cats; Itemd =.; _name_ = put(name,$8.); type = substr(name,1,2); if type = "jj" then itemd = 1; else if type = "ed" then itemd = 2; else if type = "ad" then itemd = 3; else if type = "mh" then itemd = 4; Format itemd dim1.; Run; The syntax for assigning variables to groups may be altered to produce useful variable groupings for other data sets, and the formats altered accordingly. Creating meaningful categories for variables is much easier if a planned variable naming convention was used. If it is too late to create a naming convention, the categories may be assigned by hand. For example, cats could be saved to an Excel spreadsheet where columns are created for itemd and itemd2. Each cell of the columns could be assigned a number indicating the group the variable belongs to on the two dimensions. The spreadsheet could then be imported to SAS and used as cats in the example above. Combining Grouped Variables with the Original Data Now that the data set cats contains the variables in the original data set grouped into two meaningful dimensions, cats is merged with the original data set so that missing data information may be generated by dimension. The TRANSPOSE procedure was used to produce a data set where an observation exists for every variable and observation combination in the original data set. The extracted code is as follows: Proc transpose data=original data=odbyvar; By part; Run; The new data set odbyvar contains the new variables _name_ and col1. These variables store the variable names and the value of each variable from original, respectively. The data set odbyvar was used to create a count of the number of observations having and not having missing variable values. For example: proc freq data=odbyvar noprint; tables _name_/out=tnmis; where col1 ne.; proc freq data= ODBYVAR noprint; tables _name_/out=tmis; where col1 =.; The FREQ procedures create the count of observations from the original data set with missing and non-missing values on each variable. Since the data sets tnmis and tmis both have the variable _name_ they can be merged with cats1. 1 For example, the following simplified syntax demonstrates the merge: data MIS; merge CATS1(in=c) TMIS(in=ta rename=(count=nobmis)) TNMIS(rename=(count=nobnmis)); 1 The variable name in cats1 must be renamed to _name_ 2 if c and ta=c; drop percent; The new data set MIS contains: ITEMD Dimension 1 ITEMD2 - Dimension 2 _NAME name of original variables NOBMIS count of missing observations from original NOBNMIS count of non-missing observations from original Additional calculations were made from both the data sets cats and mis using the SUMMARY procedure: proc summary data=cats; output out=tv; proc summary data=mis; var nobmis; output out=tv2 sum=msrs mean=mnobmis; The first procedure creates the data set tv that contains a count of all variables, count of variables by each dimension, and count of variables by both dimensions (see Table 1). The second procedure produces the data set tv2 that contains the total and average observations missing within all combinations of the dimension variables. OVERVIEW OF MISSING DATA REPORTS This section discusses the production of HTML reports to view the various missing data information discussed above and generated by the %GVARCAT macro. Some independence between generating the missing data information and the HTML report production was desirable. This section will again present an overview with simplified syntax from the macros. Working with the Output Delivery System SAS ODS was used to output most of the HTML reports, however PUT in a DATA step were used to generate a couple of reports and a custom menu. The custom HTML and style statements were designed to take advantage of existing capabilities created by ODS. The following demonstrates syntax used for the generation of HTML using ODS. The macro variable dfp was used to define the root directory for the output. Note that the ODS statements were specifically designed to allow later addition of information to both the body and menu files: %let title = Example Items; %let dfp =c:\example\; filename new "&dfp.origmiss.html"; filename menu "&dfp.menu.html"; filename fram "&dfp.frame.html"; ods html body=new (no_bottom_matter) contents=menu (no_bottom_matter) frame=fram style=fancyprinter; %rep1;
The no_bottom_matter phrase following the declarations of the files was critical to allow later appending of custom information to the HTML. The phrase follows the declaration of both the body and menu files. As a result the body and menu files were created as normal, but the files did not contain concluding HTML when the files are closed. The lack of closing bottom_matter allowed additional HTML or other script to be appended to the files. For example, the following syntax allowed data step PUT statements produced by the macros %REP4 and %REP5 to append information to the body file: filename new "&dfp.root\origmiss.html" mod; %rep4; %rep5; The most crucial option in this statement was the use of mod which opens the file for editing rather than creating a new file. The follow ODS statement was then used to close the file, excluding top_matte : ods html file=new(no_top_matter)anchor='end'; Adding a Custom Menu to the ODS Menu File Missing data report number 5 lists all missing variables for each participant (observation) in the data set. Since the list may be quite long, it was useful to have a menu that allowed users to jump to the listings for specific groups of participants (e.g. 1-12, 13-24, 25-36). For example, clicking on the link for participants 13-24 jumps to participant 13 in report 5 in the body file. The complete macros, %menu, %menustyl, %spanhead, %spanclos, are listed later in this paper, but as an overview, creation of the custom menu has two steps: Adding a new menu heading to the ODS menu Inserting links for each group of participants The original ODS statement specified a menu file, so menu items were constructed automatically by ODS for missing data reports 1-3. All three reports were generated from PRINT procedures that were executed within the ODS. In contrast, reports 4 and 5 were written directly into the body file through PUT statements outside ODS. No menu items were created for reports 4 & 5, however, similar to the body file, the menu file was created without adding closing HTML. ODS uses the <SPAN>, <DT>, and <DL> tags to help define each menu item, to enclose the links under the item, and to enable the click of the menu heading to toggle the visibility of the enclosed link. A new menu heading following the ODS convention was added by the macros %menustyl, %spanhead and %spanclose : <script language=javascript1.2> function lodit(v){ parent.body.location.href='origmiss.html#' + v } </script> <style type='text/css'> #menu1 { display : block; TEXT-ALIGN: center; BACKGROUND-COLOR: thistle} </style> <li><span>participants Missing Data</SPAN><br><dl><dt> <SPAN id='menu1'><b> </b> <a href=javascript:lodit('1') >1-5 </a><br> <a href=javascript:lodit('7') >7-13 </a><br> <a href=javascript:lodit('15') >15-20 </a><br> <a href=javascript:lodit('21') >21-25 </a><br> </SPAN><br></dl> The macro %menu was used to generate the participant groups that are linked to positions in the body file. The position of each participant listed for report 5 was denoted through the use of <A name=anchor> tag in the body file where ANCHOR was participant number. %menu(odbyvar2,part,mi=20,t=participa nts); The argument MI to the %menu macro dictates the number of participant groups to add to the menu. For example, with an MI=20, the macro will try to produce 20 groups of equal size from the observations. If there were 130 observations then 20 groups of 6 participant (1-5,7-13 etc.) would be created, the remaining participants listed in the last group. Note also that the numbers may not be continuous because some participants may not have missing data. MACRO EXECUTION The syntax in this section assumes all the macros in the sections to follow have been defined. In this example, the macro %GVARCAT is passed the name of a data set allc, and part, the variable uniquely identifying each observation. Creating Missing Data Information %gvarcat(allc,part,scat=); Creating HTML Reports %let title = Example Items; %let dfp =c:\example\; filename new "&dfp.root\origmiss.html"; filename menu "&dfp.root\menu.html"; filename fram "&dfp.root\frame.html"; ods html body=new (no_bottom_matter) contents=menu (no_bottom_matter) frame=fram style=fancyprinter; %rep1; %rep2; %rep3; filename new "&dfp.root\origmiss.html" mod; %rep4; %rep5; ods html file=new(no_top_matter)anchor='end'; filename menu "&dfp.root\menu.html" mod; file menu; put "<script language=javascript1.2>"; put "function lodit(v){"; put "parent.body.location.href='origmiss. html#' + v"; put "}"; put "</script>"; %menustyl; %spanhead; 3
%menu(odbyvar2,part,mi=20,t=participants); file menu; %spanclos; ods html file=menu(no_top_matter) anchor='end'; Missing Data Macros %macro gvarcat(od,oid,scat=,scat1=); proc contents data=&od noprint out=cats(keep=varnum name label nobs); data cats1 (keep=_name_ label itemd2 varnum itemd); set cats end=last; if name ne "&oid" then do; length _name_ $8; _name_ = put(name,$8.); type = substr(name,1,2); *first domain of variables; if type = "jj" then itemd = 1; else if type = "ed" then itemd = 2; else if type = "ad" then itemd = 3; else if type = "mh" then itemd = 4; * second domain of variables; per = substr(name,3,1); i = index(name,"_"); if i > 0 then do; t = substr(name,1,i-1); l = length(t)-3; tp = substr(name,4,l); if per = "b" then tp = 0; if tp = 0 then itemd2 = 1; else if tp = 3 then itemd2 = 2; else if tp = 6 then itemd2 = 3; else if tp = 9 then itemd2 = 4; else if tp = 12then itemd2 = 5; *formats domain indicators; format itemd2 dims.; format itemd dimf.; if itemd2 ne. and itemd ne. then output cats1; data cats; set cats1; %if &scat ne %then %do; if itemd =&scat then output; % %odbyvar(&od,&oid); %nobsmis(&od,odbyvar,&oid); %pmiscat(&od,odbyvar2,&oid); %m %macro odbyvar(od,oid); proc summary data=cats; output out=tv; proc sort data=cats; proc sort data=&od; by &oid; proc transpose data=&od out=odbyvar; by &oid; proc sort data=odbyvar out=l nodupkey; %m %macro nobsmis(od,d,oid); proc freq data=&d noprint; tables _name_/out=tnmis; where col1 ne.; proc freq data=&d noprint; tables _name_/out=tmis; where col1 =.; proc sql noprint; select count(&oid) into: numobs from &od; quit; data mis; merge cats(in=c) tmis(in=ta rename=(count=nobmis)) tnmis(in=t rename=(count=nobnmis)) l; if c and ta=c; label pnobmis = "% Obs Missing"; label itemd = "Dimension 1"; label itemd2 = "Dimension 2"; label nobnmis = "Obs Complete"; label allobs= "All Obs"; label nobmis = "Obs Missing"; label label = "Description"; label exrdents = "Expected Respondents"; label _name_ = "Variable"; allobs = sum(nobnmis,nobmis); pnobmis = 0; exrdents = &numobs; pnobmis = (nobmis/exrdents)*100; drop part varnum percent; *get total and average missing responses; proc summary data=mis; var nobmis; output out=tv2 sum=msrs mean=mnobmis; *total variables in each dimension; proc sort data=tv; by itemd itemd2; *total and average responses missing; proc sort data=tv2; by itemd itemd2; *tv3 is data set for report 1; data tv3; merge tv (rename=(_freq_=items)) tv2(drop=_freq_); by itemd itemd2; label itemd = "Dimension 1"; label itemd2 = "Dimension 2"; label exrs = "Expected Responses"; 4
label prsm = "% Responses Missing"; label msrs = "Missing Responses"; label mnobmis = "Average Responses Missing"; label items = "Variables"; label exrdents = "Expected Respondents"; exrdents = &numobs; exrs = items*exrdents; prsm = msrs/exrs*100; drop _type_; *each participant has observations 1 to number of variables; proc sort data=&d; proc sort data=mis; *merge variable categories with obs/variable dataset; * mis has "Obs Complete", "All Obs","Obs Missing", "Expected Respondents"; data odbyvar2; merge &d(in=d) mis(in=m); if m=d; if col1=.; %m Macros to Create a Missing Data Reports %macro rep1; proc print data=tv3 label noobs; var itemd itemd2 items exrdents exrs msrs prsm mnobmis; * var itemd itemd2 items exrdents exrs msrs prsm; title "&title"; format prsm pctfmt.; title2 "Report #1: Summary of Missing Data"; %m %macro rep2; proc print data=mis label noobs; var itemd itemd2 label _name_ nobmis pnobmis; title "&title"; title2 "Report #2: Respondents (Observations) Missing by Item"; %m %macro rep3; proc sort data=tipp2; by descending tmisp; proc print data=tipp2 label noobs; var part itemd itemd2 vmis tipp tmisp; title "&title"; title2 "Report3:Missing Items by Observation"; %m %macro rep4; proc sort data=tipp2; by itemd itemd2 descending misp part; file new; t = put(date(),mmddyy8.); put "<center>"; put "<h1>&title</h1>"; put "<h2>data Collection Guide</h2>"; put "<h3>" t "</h3></center>"; set tipp2 nobs=tot; file new; by itemd itemd2 descending misp part; if _n_ = 1 then do; put "<h2>report 4: Data Collection Needed for the Following Participants</h2>"; if first.itemd then do; put "<h3>" itemd "</h2>"; if first.itemd2 then do; put "<table border=1 width=700> <tr><td colspan=2><b>" itemd2 "</b></td></tr>"; put "<tr><td>% Items Missing<td>Participants</tr>"; if first.misp then do; put "<tr><td>" tmisp "<td>"; put "<a href=#" part ">" part "</a>"; if last.itemd2 then put "</tr></table><br>"; %m %macro rep5; *filename new "&dfp.root\origmiss.html" mod; *items needed within participant; %let block = "block"; proc sort data=odbyvar2; by part itemd2 itemd _name_; set odbyvar2 nobs=tot; by part itemd2 itemd _name_; retain col 0; file new; if first.part then do; put "<h1>participant # <a name=" part ">" part "</a> is Missing the Following Items: </h1>"; if first.itemd2 then do; put "<table border=1 width=700>"; if first.itemd then do; put "<tr><td 5
colspan=4><h3>" itemd2 itemd "</h3><br></tr><tr>"; col = 0; col = col + 1; put "<td> " _label_ "<br>"; if mod(col,4) = 0 then put "<tr>"; if last.itemd2 then do; put "</table>"; %m Macros to Add a Custom Menu to the ODS Menu %macro menustyl; put "<style type='text/css'>"; put " #menu1 { display : block; TEXT- ALIGN: center; BACKGROUND-COLOR: thistle}"; put "</style>"; %m %macro spanhead; put "<li><span>participants Missing Data</SPAN><br><dl><dt>"; put "<SPAN id='menu1'><b> </b>"; %m %macro spanclos; put "</SPAN><br></dl>"; %m %macro menu(d,uid,mi=10,t=); proc sort data=&d nodupkey out=parts; by part; file menu; set parts nobs=t; retain p w r s sc lr ur 0; if _n_ = 1 then do; p = &mi; *number of equal segments; w = floor(t/p); *width of each segment; r = t - p*w; *remaining obs past last equal segment; lr = &uid; * begin with first obs; s =1; *segment 1; if sc = 0 then lr = part; if s <= p then do; *doing one of the equal segments; sc = sc + 1; *within segment count; if sc = w then do; *at the upper value of the segment; ur = &uid; pt = "'" trim(left(lr)) "')"; put "<a href=javascript:lodit(" pt ">" lr "-" ur "</a><br>"; s = s + 1; sc = 0; else do; *do remaining obs after last segment; sc= sc + 1; if r > 0 and _n_ = t then do; ur = &uid; pt = "'" trim(left(lr)) "')"; put "<a href=javascript:lodit(" pt ">" lr "-" ur "</a>"; %m CONCLUSION The missing data reports presented here have been useful in providing swift feedback on the progress of data collection activities. The reports and the organization of the reports in HTML are in a continuous process of refinement based on the needs of staff. But, producing the reports as a scheduled output has reduced ad hoc requests for specific reports on missing data. REFERENCES SAS Institute, Inc. The Complete Guide to the ODS Output Delivery System, Version 8, Cary, NC: SAS Institute Inc., 1999. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Patrick Thornton, Ph.D. Child Services Research Group UCSF 1388 Sutter Street Suite 503 San Francisco, CA 94109 Work Phone: 415 502-8004 Fax: 415 502-6177 E-mail Address: pthornt@itsa.ucsf.edu www.wolfstrategies.com 6