Automating the Production of Formatted Item Frequencies using Survey Metadata Tim Tilert, Centers for Disease Control and Prevention (CDC) / National Center for Health Statistics (NCHS) Jane Zhang, CDC / NCHS Lewis Berman, CDC / NCHS 1. ABSTRACT The National Health and Nutrition Examination Survey (NHANES) collects a vast array of questionnaire and examination data regarding the health and nutritional status of the United States population. Ongoing release of NHANES data to the public is one of the many tasks associated with the survey. Codebooks consisting of data item names and associated metadata, along with corresponding item frequencies, accompany the public data release. The challenge is to utilize existing metadata to automate the production of the detailed response or exam result frequencies for each and every data item released. This poster will illustrate a novel solution utilizing the SAS/IntrNet system along with the unique challenges posed by combining metadata with actual survey data for the production of automated frequency distributions. These challenges include associating item labels from the metadata with the actual survey data via dynamic SAS formats, systematically computing ranges for data which were not coded, handling floating point number limitations, ordering the final results in a standardized fashion, and updating the database with the resulting computed frequencies. 2. INTRODUCTION The NHANES is designed to monitor the health and nutritional status of the U.S. population. In 1999, NHANES became a continuous survey fielded on an ongoing basis. The survey sample selected each year is a multi-staged probability sample of persons of all ages and is representative of the noninstitutionalized U.S. civilian population. Data are released in two year cycles. Participation in the survey is voluntary. Findings are reported for the total U.S. population, as well as for selected race/ethnicity groups such as African Americans and Mexican Americans living in the U.S. NHANES data are obtained by personal interviews, health examinations, and laboratory tests. All data collection methods follow standardized protocols. Initially, people that are selected for the survey samples are interviewed in their homes. The interviewed individual is then invited to participate in a health examination component. The health examinations are conducted in Mobile Examination Centers (MEC). Examinees receive a preliminary report of their examination findings at the conclusion of the MEC exam and a final report of findings after all laboratory processing is completed. Page 1 of 7
3. PROBLEM For each survey component (Blood Pressure Exam, Total Cholesterol Lab, Prescription Medication Questionnaire, for example), there are numerous exam, lab, or questionnaire items. Tied to the public release of the data, the National Center for Health Statistics (NCHS) releases frequencies for each of these items. There is a great degree of tedium in producing these frequencies for several reasons. First, some of the items have character values while other items have numeric values. This becomes an issue in that one cannot simply run proc means or proc freq for all items to produce frequencies. Another challenge is that many of these survey items (both character and numeric) have several hundred or even thousands of distinct values. This becomes an issue because a simple proc freq statement will produce a table which is too difficult to read and is unmanageable from a publication standpoint. In the past, a programmer was assigned to each component to address these issues. These programmers had to walk through each component, item by item, and determine whether proc freq or proc means should be run for each item, for each component. In addition, these programmers also had to write out SAS format statements for each item so that the resulting frequencies were formatted correctly. The goal of this effort was to find a way to automate these frequencies, dynamically and automatically format all the values for each item, convert unmanageable lists of distinct values to value ranges, and order the resulting output in an easy to understand, consistent order. 4. APPROACH and METHODOLOGY By utilizing the pre-existing metadata that was created and validated in a web-based codebook application, it became possible to automate the production of the survey frequencies. A series of SAS macros were developed to combine the data to be released (residing in SAS datasets) with the preexisting metadata (stored in Sybase ). Through the integration of the web-based codebook application with SAS/IntrNet, users are now able to call these SAS macros directly from the web-based codebook application to dynamically and automatically format all the values for each survey item, convert unmanageable lists of distinct values to value ranges, order the resulting output in an easy to understand consistent order, and save this final frequency output to the Sybase database. 4.1 DYNAMIC SAS FORMATS In order to explain the development methodology, it is important to understand the metadata. The metadata for each survey component is stored in Sybase, which are then presented as Hyper Text Markup Language (HTML) codebooks or data dictionaries. Below are two excerpts from the NHANES 2001-2002 Cardiovascular Fitness Examination codebook: Page 2 of 7
CVQ220m English Text: Reason for Priority 2 Stop: Other specified reasons Codes: 1= Yes 2= No Priority 2 Stop, other specified reasons Skip To Values: CVDEXLEN Length of CV fitness exam (min) English Text: Length of the CV fitness exam (minutes) All of the values presented in these codebook excerpts are stored in a metadata database and it is these values which are used to dynamically create the formatted frequencies. In order to create the formatted frequencies, the first requirement is to read all the item names ( CVQ220m, CVDEXLEN ) and corresponding coded values (1=Yes, 2=No) for these items from the Sybase tables into separate SAS datasets. Then, in order to dynamically create the SAS formats, each item requires its own unique format name. Since we have a limited number of items in a survey component, the approach is to simply use the observation number (_n_) to create the unique format names while still satisfying the SAS format constraints of all format names being eight characters or less and not ending with a digit. After creating the format names, the program then loops through all the items. Then, as it is defined in the metadata database, if the item is numeric, the format name begins with fm and if the item is character, the format name begins with $fm. See the code below:!"#$ % & &'' ((()*+++''&& % & &'' ((()*+++''&& Page 3 of 7
Then, depending on whether or not the item is character or numeric, the appropriate macro is called to create the SAS formats. This is fairly straightforward. Two SAS datasets are created (one for numeric values and one for character values) which contain the starting value, the ending value and the label to be used when formatting individual values. These datasets are then employed in the SAS proc format statements later in the program. 4.2 CONVERT DISTINCT VALUES TO VALUE RANGES Most of the SAS formats are straightforward with one exception converting overly large lists of values to a value range. For example, the length of a Cardiovascular Fitness exam (CVDEXLEN) has 791 distinct values, far too many to be practically displayed in a single frequency table. The approach taken is to run proc freq for every item, regardless of whether or not it is character or numeric. There is a value in our metadata table which designates the maximum number of discrete values that we will allow to display in a frequency table. The default is 50. This means that if more than 50 distinct uncoded values are found for an item, then these distinct values are converted to one range of values. This test and subsequent conversion are accomplished by outputting the frequencies generated and counting the number of records in the resulting output file. If the number of records in the output file exceeds the maximum number of values allowed, then the outputted values are converted to a range for numeric values or simply labeled using the desired metadata label for character values. If the number of records in the output file is less than the maximum number of values allowed, then the outputted values are simply displayed as they are. Since SAS sorts frequencies by default and the frequencies have just been saved to a file, it is very straightforward at this point to create the range of values. The first record in the output frequency file becomes the from value in the range while the last record in the output frequency file becomes the to value in the range. 4.3 HANDLING FLOATING POINT NUMBER LIMITATIONS Once the range issue had been solved, the application worked well but periodically the output for a given item contained one of the range delimiters as its own value record, in addition to a formatted range of values. This duplication only happens with floating point numeric values or numbers with decimal places. After looking through the temporary datasets, it was discovered that the numbers don t match exactly, as they are off in the outermost decimal places. This mismatch is due to the limitations of floating point numeric representation which exists in nearly every software package and hardware device. With some research 1, it was determined that there is a fuzz value that can be used in the format datasets that tells SAS to ignore differences less than a certain precision value. Since the differences are all past six decimal places and that level of precision is not required, the fuzz value in the numeric formats is set to.00001. This resolves all of the data misrepresentations. Page 4 of 7
4.4 ORDERING THE FINAL RESULTS Sorting the output values is not a trivial task. The values for an item can be either character or numeric. There are significant differences between sorting numeric values and sorting character values and an algorithm was needed that would work in all cases. Since the maximum length of a coded value was decided upon a priori to be 40 characters in our database, we chose to create a special character variable (dom_val_sort) in the database that could be used for sorting the values, also with a length of 40. If the coded value was numeric, the value of dom_val_sort was front-filled with blanks. This way instead of 40 preceding 4 when sorting with the coded value itself, the value of 4 would always precede the value of 40 when sorted using dom_val_sort. Conversely, if the coded value was character, the value of dom_val_sort was back-filled with blanks. Finally, to ensure that the MISSING values are always displayed last, the dom_val_sort value was set to a 40 character Z filled string so that missing records would always be displayed last in the outputted frequencies. 4.5 UPDATING THE DATABASE In order to produce the HTML output using the previously developed web application, the database needs to be updated to include the frequencies as well as the newly created sort order variable (dom_val_sort). This was accomplished using a simple proc append statement. In the very first attempt at updating the database, the program elicited the following error: Unable to update a Sybase table with an Identity field with SAS V8.2. After more research 2, it was discovered that this was a known error in SAS V8.2 and required the download and installation of SAS technical support hotfix 82SB09. After applying the hotfix, the program was then able to successfully update the database. 4.6 RESULTS Below are the same codebook excerpts shown earlier from the NHANES 2001-2002 Cardiovascular Fitness Examination codebook. These excerpts are from the new codebooks. Note that these examples now include the automatically-computed, formatted frequencies: CVQ220m Priority 2 Stop, other specified reasons English Text: Reason for Priority 2 Stop: Other specified reasons Code or Value Description Count Skip to Item 1 Yes 42 2 No 411. Missing 4699 Page 5 of 7
CVDEXLEN Length of CV fitness exam (min) English Text: Length of the CV fitness exam (minutes) Code or Value Description Count Skip to Item 0 to 36.73 Range of Values 5152. Missing 0 5. CONCLUSIONS By combining existing metadata with survey release data, it is possible to take a long, tedious, very userinvolved process and turn it into an easy to use, automated, SAS/IntrNet program. In prior releases, codebooks were tediously created via manual data entry into Microsoft FrontPage. The frequency files were also manually created from user-defined macros for each survey item. Moving forward, it is now possible to combine the codebook information with formatted frequencies into a singular output file and produce this file automatically without any manual user intervention. This significantly speeds up and simplifies the release process and offers the end users an easier-to-use, fully integrated data dictionary complete with frequencies. 6. REFERENCES 1. Pete Lund, More than Just Value: A Look into the Depths of PROC FORMAT, SAS Users Group International 27th Annual Conference Proceedings - http://www2.sas.com/proceedings/sugi27/p004-27.pdf 2. SAS Technical Support Web Site, SN-010867, Unable to update a Sybase table with an Identity field with SAS V8.2 - http://support.sas.com/techsup/unotes/sn/010/010867.html 7. ACKNOWLEDGEMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. Page 6 of 7
8. CONTACT INFORMATION Tim Tilert Centers for Disease Control and Prevention / National Center for Health Statistics 3311 Toledo Rd. Hyattsville, MD20782 Work phone: (301) 458-4284 Fax: (301) 458-4029 E-mail: tnt6@cdc.gov Date Last Modified: September 7, 2004 Submitted to: The Northeast SAS Users Group Page 7 of 7