%MISSING: A SAS Macro to Report Missing Value Percentages for a Multi-Year Multi-File Information System Rushi Patel, Creative Information Technology, Inc., Arlington, VA ABSTRACT It is common to find missing values in datasets. An analyst deals with missing values in a variety of ways that range from simply ignoring these records to robust imputations. The decision to choose a strategy to tackle missing values depends upon their magnitude and relevance of the corresponding variable. This paper presents a macro that outputs the missing value percentages for all variables in all files (or for selected variables in selected files) in a multi-year information system. The focus is on using SAS Macro Language to dynamically write pieces of code, MISSING and NMISS functions, iterative macro execution and ODS to export the final report. The intended audience is beginner to intermediate programmers interested in automating tasks through utility macros. INTRODUCTION Missing values are a reality of datasets. An analyst has several options to tackle missing values. However, the option chosen is largely a function of two factors. First, is the variable in question and second is the extent to which that variable is coded as missing. This paper describes a macro coded to generate a report that enlists the percentage of missing values for each variable present in Highway Safety Information System (HSIS), a multi-state, multi-file, multi-year information system. HSIS is comprised of nine geographic states and each of these states has 4 to 10 files. Again, each file is present for any or all years from 1985 onward. Initially the paper gives an overview of the layout of files, followed by a brief description of the basic procedures and functions used to compute the missing percentages. The focus then shifts to describing a routine that generates the required pieces of code through a series of macros that are invoked iteratively within a blanket macro (%MISSING). LAYOUT OF DATA FILES IN THE INFORMATION SYSTEM Broadly speaking, each file in HSIS can either be a crash (accident) based file or a segment based file. A segment based file has a specific length for each observation, while crash based files are point files. For segment based files, the desired output is a percentage of missing values as a function of total length within the file, while for crash based files the desired output is a function of total number of observations. Both sample data files are listed below. /* crash based file */ Data <state>_<file>_<year> Input $a b Cards a 1 b. 2 C 3 Run /* segment based file */ Data <state>_<file>_<year> Input $a c length Cards a 4 10 3 10 c. 20 Run The output for a crash based file will show 25% missing (1 in 4 observations) for variables a and b. While for segment based file, the output for variable a will be 25% missing (10 units of length out of a total of 40) and for variable c will be 50% missing (20 units of length out of a total of 40). BASIC PROCEDURES AND FUNCTIONS NMISS() and COUNT() functions do the trick for crash based files. NMISS() returns the number of missing values for a variable or a list of variables passed through it. COUNT() returns the number of observations in any dataset. The following template of code computes the missing value percentages for the specified variables in a dataset. proc sql create table <table> as select nmiss(<variable1>) / count(*) as miss_<variable1>_pct, nmiss(<variable2>)/ count(*) as miss_<variable2>_pct. from <state>_<file>_<year> quit 1
MISSING() and SUM() functions are used to compute the percentage of missing values for segment based files. MISSING() returns 0 or 1 for a data point that is present or missing. SUM() returns the summation of a variable across the entire dataset. The following template computes the required percentages for segment-based files. proc sql create table <table> as select sum(missing(<variable1>) *seg_lng) / sum(seg_lng) as miss_<variable1>_pct, sum(missing(<variable2>) *seg_lng) / sum(seg_lng) as miss_<variable2>_pct. from <state>_<file>_<year> DESIRED OUTPUT Table 1 shows a sample of the desired output. %MISSING generates this output for all years that a file is available for all statefile combinations. The value in a year-variable cell represents the missing value percentage depending upon whether a file is crash based or segment based. A. indicates a variable is not present for that year. Table 1: Report for Minnesota Accident data Variables Year 2000 Year 2001 Year 2002 Year 2003 Variable 1 10.4 13.5 0 12.1 Variable 2.. 100 100 Variable 3 12 12 12 12 Variable 4 0 0.. MACRO STRUCTURE %MISSING consists of a series of macros that are iteratively invoked within it. The steps involved are 1. For each file available for a particular state, determine whether that file is segment based or crash based. This is done within %MISSING and a global macro variable rdcheck is assigned a value Y (indicating a segment based file) or N (indicating a crash based file). 2. Based on whether rdcheck is Y or N, generate the relevant select statements within a PROC SQL routine. This is done by two support macros, %CRASH_BASED or %SEGMENT_BASED, which are invoked conditionally within another macro %COMPUTE_PCT. 3. Invoke %COMPUTE_PCT for all state-file-year combinations. This is done by populating a series of macro variable arrays and stepping into those iteratively through %do loops within %MISSING. 4. Collect, organize and restructure the output generated from each execution of %COMPUTE_PCT. This is also done within %MISSING using PROC TRANSPOSE, APPEND and DATA steps. 5. Generate the desired reports through ODS done within %MISSING. A pre-requisite for executing %MISSING is a dataset names that has three variables state, file and year. This dataset has all the possible state-file-year combinations. The first PROC SQL routine in %MISSING refers to this dataset. Also, by screening dataset names, a user can execute %MISSING on a limited part of HSIS. Libname data points to an ORACLE database, on which HSIS resides. proc contents data = data._all_ noprint out=temp (keep=memname) data names (keep = st yr file) set temp by memname if first.memname len=length(memname) st=substr(memname,1,2) yr=substr(memname,3,2) start=5 run=len-start+1 file = substr(memname,start,run) run 2
SUPPORT MACROS - %CRASH_BASED and %SEGMENT_BASED options mprint %macro crash_based (vars = ) %local i v list %let i = 1 %let v = %scan(&vars,&i) %do %while (%length(&v) ^= 0) %let list = &list, (nmiss(&v)/count(*)) * 100 as miss_&v._pct %let i = %eval( &i + 1 ) %let v = %scan(&vars,&i) %end %substr(%superq(list),2) %mend crash_based A macro variable array consisting of all variable names present in a particular state-year-file combination is passed to either of these macros (parameter vars=). This macro variable array is populated within %COMPUTE_PCT through a PROC SQL routine before executing the support macros. The macro quoting function %superq() is used to mask the special characters generated within the code. The details of this and related quoting functions are beyond the scope of this paper and the reader is referred to some prior work by Whitlock (1). %SEGMENT_BASED is similar to %CRASH_BASED and is not listed here. %COMPUTE_PCT %macro compute_pct ( root =, file =, year = ) proc sql noprint select lowcase(name) into :vars separated by " " from dictionary.columns where libname = "DATA" and memname = "%upcase(&root&year&branch)" create table temp as %if &rdcheck = Y %then %do select %segment_based( vars = &vars ) from data.&root&year&branch %if &rdcheck = N %then %do select %crash_based( vars = &vars ) from data.&root&year&branch quit proc transpose data = temp out = temp data temp format miss_pct 4.2 length varname $17 set temp ( rename = ( _name_ = varname col1 = miss_pct ) ) label varname = "Variable" 3
varname = substr( varname, 6 ) varname = substr ( varname, 1, length ( varname ) - 4 ) year = &year run %mend compute_pct Parameters passed to this macro are root= (state), file= (name of the file) and year= (each year for a state-file combination), the values of which are generated in %MISSING. It checks the rdcheck variable value and executes either of the two support macros based on its value. It populates a macro variable array vars which is passed to the support macros. PROC TRANSPOSE and the following DATA step is for organizing the output. %MISSING This macro manages the execution of the %COMPUTE_PCT and is referred in this paper as the blanket macro. This macro creates the macro variables that are passed to %COMPUTE_PCT, through a series of %do loops. It takes all the output from %COMPUTE_PCT and appends it in a file, creating a separate report for each state-file combination. Finally, it outputs the report as a MS Word file for each state-file combination to the designated directories using Output Delivery System (ODS). %macro MISSING %global rdcheck select distinct(st) into: state separated by from names quit %let i=1 %let root=%scan(&state,&i) %do %while(%length(&root) > 0) select distinct(file) into: FILES&root separated by ' ' from names where st="&root" run %let j=1 %let branch=%scan(&&files&root,&j) %do %while(%length(&branch) > 0) select yr into: &root.&branch separated by ' ' from names where st="&root" and file = "&branch" run %let k=1 %let getyr = %scan(&&&root.&branch,&k) %do %while (%length(&getyr) > 0) 4
select quote(trim(lowcase(name))) into:dum separated by "," from dictionary.columns where libname = "DATA" and memname = "%upcase(&root&year&branch)" select distinct case when "seg_lng" in (&dum) then 'Y' else 'N' end as var1 into : rdcheck from dictionary.columns where libname = "DATA" and memname = "%upcase(&root&year&branch)" %compute_pct( root=&root, file=&branch, year=&getyr ) proc append base = dat1 data = temp run %let k = %eval(&k + 1) %let getyr = %scan(&&&root.&branch, &k) proc sort data = dat1 by varname year proc transpose data = dat1 out = dat1 ( drop = _name_ ) prefix = year by varname id year ods rtf file = "d:\missing Drive\&root\&branch..rtf" proc print title "Percent Missing for &root. &branch. files" ods rtf close proc datasets lib= work delete dat1 temp run %let j=%eval(&j+1) %let branch=%scan(&&files&root,&j) %let i=%eval(&i+1) %let root=%scan(&state,&i) %mend MISSING CONCLUSION The main feature of this macro is that it can generate the required report automatically, with minimal user intervention. Also, with simple modifications, it can be easily executed on a limited part of the information system. Along with determining all variables in any dataset in the information system, it also determines the type of the dataset. The output allows the user to 5
identify variables with high percentage missing observations and also variables with an abrupt change in percentage of missing observations. Finally, it outputs the report as a MS Word file in a specified directory for future reference and sharing. ACKNOWLEDGEMENTS I thank Creative Information Technology, Inc. (CITI) for sponsoring this paper. CITI is currently providing SAS services through engagements with U.S. Federal Government clients at the Departments of Housing and Urban Development and Transportation. I am grateful to Mr. Ian Whitlock for his immensely useful suggestions in response to my inquiry on SAS-L. The basic macro design discussed here was suggested by Mr. Whitlock. Many thanks, to Mr. Yusuf Mohamedshah and Dr. Forrest Council at Highway Safety Information System (HSIS) for their continued encouragement to develop macros to streamline HSIS data analysis tasks. And last but not the least I thank my wife and my mother for being a constant source of support and motivation. REFERENCES 1. Whitlock, Ian. A serious look at Macro Quoting. Paper 11, Proceedings of twenty eighth annual SAS-User Group International Conference. Available online at www2.sas.com/proceedings/sugi28/011-28.pdf 2. Highway Safety Information System website. www.hsisinfo.org. AUTHOR CONTACT Your comments and questions are valued and encouraged. Contact the author at: Rushi B. Patel Creative Information Technology, Inc., 1010 N. Glebe Road, Suite 710, Arlington, VA 22201 E-mail 1: rushi.b.patel@gmail.com E-mail 2: rpatel@citi-us.com Phone: 240-383-6207 SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. 6