Automating the Production of Formatted Item Frequencies using Survey Metadata

Similar documents
Summarizing Impossibly Large SAS Data Sets For the Data Warehouse Server Using Horizontal Summarization

ABSTRACT MORE THAN SYNTAX ORGANIZE YOUR WORK THE SAS ENTERPRISE GUIDE PROJECT. Paper 50-30

SAS Application to Automate a Comprehensive Review of DEFINE and All of its Components

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University

SAS Macro Technique for Embedding and Using Metadata in Web Pages. DataCeutics, Inc., Pottstown, PA

Statistics, Data Analysis & Econometrics

How to Implement the One-Time Methodology Mark Tabladillo, Ph.D., MarkTab Consulting, Atlanta, GA Associate Faculty, University of Phoenix

There s No Such Thing as Normal Clinical Trials Data, or Is There? Daphne Ewing, Octagon Research Solutions, Inc., Wayne, PA

CMISS the SAS Function You May Have Been MISSING Mira Shapiro, Analytic Designers LLC, Bethesda, MD

Using Metadata Queries To Build Row-Level Audit Reports in SAS Visual Analytics

Dear friends of Survey Solutions,

Centers for Disease Control and Prevention National Center for Health Statistics

Quality Control of Clinical Data Listings with Proc Compare

Chaining Logic in One Data Step Libing Shi, Ginny Rego Blue Cross Blue Shield of Massachusetts, Boston, MA

NRS STATE DATA QUALITY CHECKLIST

Lecture 1 Getting Started with SAS

Making the most of SAS Jobs in LSAF

HEALTH AND RETIREMENT STUDY 2006 Internet Survey Final, Version 1.0 November Data Description and Usage. November 2008, Version 1.

SESUG 2014 IT-82 SAS-Enterprise Guide for Institutional Research and Other Data Scientists Claudia W. McCann, East Carolina University.

INTRODUCTION to SAS STATISTICAL PACKAGE LAB 3

Events User Guide for Microsoft Office Live Meeting from Global Crossing

Using a Fillable PDF together with SAS for Questionnaire Data Donald Evans, US Department of the Treasury

A SAS/AF Application for Linking Demographic & Laboratory Data For Participants in Clinical & Epidemiologic Research Studies

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

Version 8 Base SAS Performance: How Does It Stack-Up? Robert Ray, SAS Institute Inc, Cary, NC

Hands-On Workshops. Creating Java Based Applications

Resolving Text Substitutions

Let s get started with the module Getting Data from Existing Sources.

A Visual Step-by-step Approach to Converting an RTF File to an Excel File

Using SAS software to shrink the data in your applications

What to Expect When You Need to Make a Data Delivery... Helpful Tips and Techniques

Omitting Records with Invalid Default Values

SAS Online Training: Course contents: Agenda:

The Impossible An Organized Statistical Programmer Brian Spruell and Kevin Mcgowan, SRA Inc., Durham, NC

WHO STEPS Surveillance Support Materials. STEPS Epi Info Training Guide

Pharmaceuticals, Health Care, and Life Sciences

Maryland OneStop Statewide License Portal State of Maryland Department of Information Technology

CC13 An Automatic Process to Compare Files. Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

ANSI Standards: Creating a local, searchable database

Copy That! Using SAS to Create Directories and Duplicate Files

SUGI 29 Data Warehousing, Management and Quality

Paper HOW-06. Tricia Aanderud, And Data Inc, Raleigh, NC

Patient Portal User Guide The Patient s Guide to Using the Portal

Checking for Duplicates Wendi L. Wright

Paper PS05_05 Using SAS to Process Repeated Measures Data Terry Fain, RAND Corporation Cyndie Gareleck, RAND Corporation

Indenting with Style

SAS and Electronic Mail: Send faster, and DEFINITELY more efficiently

Eaton Corporation. Prescription Benefits Managed by Express Scripts FREQUENTLY ASKED QUESTIONS

Chapter 17: INTERNATIONAL DATA PRODUCTS

Nuix ediscovery Specialist

The Dataset Attribute Family of Classes Mark Tabladillo, Ph.D., Atlanta, GA

How to Create Data-Driven Lists

ABSTRACT INTRODUCTION TRICK 1: CHOOSE THE BEST METHOD TO CREATE MACRO VARIABLES

FULBRIGHT VISITING SCHOLAR PROGRAM

Perceptive Process Mining

22S:166. Checking Values of Numeric Variables

GETTING STARTED Contents

Create a SAS Program to create the following files from the PREC2 sas data set created in LAB2.

Functionality Guide. for CaseWare IDEA Data Analysis

Tools to Facilitate the Creation of Pooled Clinical Trials Databases

SOST 201 September 20, Stem-and-leaf display 2. Miscellaneous issues class limits, rounding, and interval width.

Edition. MONTEREY COUNTY BEHAVIORAL HEALTH MD User Guide

Using SAS Macros to Extract P-values from PROC FREQ

Analysis of Complex Survey Data with SAS

Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

A Blaise Editing System at Westat. Rick Dulaney, Westat Boris Allan, Westat

Going Under the Hood: How Does the Macro Processor Really Work?

TLF Management Tools: SAS programs to help in managing large number of TLFs. Eduard Joseph Siquioco, PPD, Manila, Philippines

Once the data warehouse is assembled, its customers will likely

Frequency, proportional, and percentage distributions.

100 THE NUANCES OF COMBINING MULTIPLE HOSPITAL DATA

How to Implement the One-Time Methodology Mark Tabladillo, Ph.D., Atlanta, GA

Macro Method to use Google Maps and SAS to Geocode a Location by Name or Address

The Benefits of Traceability Beyond Just From SDTM to ADaM in CDISC Standards Maggie Ci Jiang, Teva Pharmaceuticals, Great Valley, PA

Beginner Beware: Hidden Hazards in SAS Coding

Processing SAS Data Sets

Paper William E Benjamin Jr, Owl Computer Consultancy, LLC

Utilizing the VNAME SAS function in restructuring data files

Data Quality Control: Using High Performance Binning to Prevent Information Loss

Foundation Level Syllabus Usability Tester Sample Exam

Programming Beyond the Basics. Find() the power of Hash - How, Why and When to use the SAS Hash Object John Blackwell

Cause/reason (if currently known)

Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

Part A. EpiData Entry

Paper Haven't I Seen You Before? An Application of DATA Step HASH for Efficient Complex Event Associations. John Schmitz, Luminare Data LLC

ABSTRACT INTRODUCTION MACRO. Paper RF

Data Quality Assessment Tool for health and social care. October 2018

Loading Data. Introduction. Understanding the Volume Grid CHAPTER 2

Using SAS Enterprise Guide to Coax Your Excel Data In To SAS

Reading in Data Directly from Microsoft Word Questionnaire Forms

Simple Data Flow ForWord

Useful Tips When Deploying SAS Code in a Production Environment

Parallelizing Windows Operating System Services Job Flows

SAS 9 Programming Enhancements Marje Fecht, Prowerk Consulting Ltd Mississauga, Ontario, Canada

Introduction. Getting Started with the Macro Facility CHAPTER 1

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

SOPHISTICATED DATA LINKAGE USING SAS

Quick Reference Guide

Anatomy of a Merge Gone Wrong James Lew, Compu-Stat Consulting, Scarborough, ON, Canada Joshua Horstman, Nested Loop Consulting, Indianapolis, IN, USA

Transcription:

Automating the Production of Formatted Item Frequencies using Survey Metadata Tim Tilert, Centers for Disease Control and Prevention (CDC) / National Center for Health Statistics (NCHS) Jane Zhang, CDC / NCHS Lewis Berman, CDC / NCHS 1. ABSTRACT The National Health and Nutrition Examination Survey (NHANES) collects a vast array of questionnaire and examination data regarding the health and nutritional status of the United States population. Ongoing release of NHANES data to the public is one of the many tasks associated with the survey. Codebooks consisting of data item names and associated metadata, along with corresponding item frequencies, accompany the public data release. The challenge is to utilize existing metadata to automate the production of the detailed response or exam result frequencies for each and every data item released. This poster will illustrate a novel solution utilizing the SAS/IntrNet system along with the unique challenges posed by combining metadata with actual survey data for the production of automated frequency distributions. These challenges include associating item labels from the metadata with the actual survey data via dynamic SAS formats, systematically computing ranges for data which were not coded, handling floating point number limitations, ordering the final results in a standardized fashion, and updating the database with the resulting computed frequencies. 2. INTRODUCTION The NHANES is designed to monitor the health and nutritional status of the U.S. population. In 1999, NHANES became a continuous survey fielded on an ongoing basis. The survey sample selected each year is a multi-staged probability sample of persons of all ages and is representative of the noninstitutionalized U.S. civilian population. Data are released in two year cycles. Participation in the survey is voluntary. Findings are reported for the total U.S. population, as well as for selected race/ethnicity groups such as African Americans and Mexican Americans living in the U.S. NHANES data are obtained by personal interviews, health examinations, and laboratory tests. All data collection methods follow standardized protocols. Initially, people that are selected for the survey samples are interviewed in their homes. The interviewed individual is then invited to participate in a health examination component. The health examinations are conducted in Mobile Examination Centers (MEC). Examinees receive a preliminary report of their examination findings at the conclusion of the MEC exam and a final report of findings after all laboratory processing is completed. Page 1 of 7

3. PROBLEM For each survey component (Blood Pressure Exam, Total Cholesterol Lab, Prescription Medication Questionnaire, for example), there are numerous exam, lab, or questionnaire items. Tied to the public release of the data, the National Center for Health Statistics (NCHS) releases frequencies for each of these items. There is a great degree of tedium in producing these frequencies for several reasons. First, some of the items have character values while other items have numeric values. This becomes an issue in that one cannot simply run proc means or proc freq for all items to produce frequencies. Another challenge is that many of these survey items (both character and numeric) have several hundred or even thousands of distinct values. This becomes an issue because a simple proc freq statement will produce a table which is too difficult to read and is unmanageable from a publication standpoint. In the past, a programmer was assigned to each component to address these issues. These programmers had to walk through each component, item by item, and determine whether proc freq or proc means should be run for each item, for each component. In addition, these programmers also had to write out SAS format statements for each item so that the resulting frequencies were formatted correctly. The goal of this effort was to find a way to automate these frequencies, dynamically and automatically format all the values for each item, convert unmanageable lists of distinct values to value ranges, and order the resulting output in an easy to understand, consistent order. 4. APPROACH and METHODOLOGY By utilizing the pre-existing metadata that was created and validated in a web-based codebook application, it became possible to automate the production of the survey frequencies. A series of SAS macros were developed to combine the data to be released (residing in SAS datasets) with the preexisting metadata (stored in Sybase ). Through the integration of the web-based codebook application with SAS/IntrNet, users are now able to call these SAS macros directly from the web-based codebook application to dynamically and automatically format all the values for each survey item, convert unmanageable lists of distinct values to value ranges, order the resulting output in an easy to understand consistent order, and save this final frequency output to the Sybase database. 4.1 DYNAMIC SAS FORMATS In order to explain the development methodology, it is important to understand the metadata. The metadata for each survey component is stored in Sybase, which are then presented as Hyper Text Markup Language (HTML) codebooks or data dictionaries. Below are two excerpts from the NHANES 2001-2002 Cardiovascular Fitness Examination codebook: Page 2 of 7

CVQ220m English Text: Reason for Priority 2 Stop: Other specified reasons Codes: 1= Yes 2= No Priority 2 Stop, other specified reasons Skip To Values: CVDEXLEN Length of CV fitness exam (min) English Text: Length of the CV fitness exam (minutes) All of the values presented in these codebook excerpts are stored in a metadata database and it is these values which are used to dynamically create the formatted frequencies. In order to create the formatted frequencies, the first requirement is to read all the item names ( CVQ220m, CVDEXLEN ) and corresponding coded values (1=Yes, 2=No) for these items from the Sybase tables into separate SAS datasets. Then, in order to dynamically create the SAS formats, each item requires its own unique format name. Since we have a limited number of items in a survey component, the approach is to simply use the observation number (_n_) to create the unique format names while still satisfying the SAS format constraints of all format names being eight characters or less and not ending with a digit. After creating the format names, the program then loops through all the items. Then, as it is defined in the metadata database, if the item is numeric, the format name begins with fm and if the item is character, the format name begins with $fm. See the code below:!"#$ % & &'' ((()*+++''&& % & &'' ((()*+++''&& Page 3 of 7

Then, depending on whether or not the item is character or numeric, the appropriate macro is called to create the SAS formats. This is fairly straightforward. Two SAS datasets are created (one for numeric values and one for character values) which contain the starting value, the ending value and the label to be used when formatting individual values. These datasets are then employed in the SAS proc format statements later in the program. 4.2 CONVERT DISTINCT VALUES TO VALUE RANGES Most of the SAS formats are straightforward with one exception converting overly large lists of values to a value range. For example, the length of a Cardiovascular Fitness exam (CVDEXLEN) has 791 distinct values, far too many to be practically displayed in a single frequency table. The approach taken is to run proc freq for every item, regardless of whether or not it is character or numeric. There is a value in our metadata table which designates the maximum number of discrete values that we will allow to display in a frequency table. The default is 50. This means that if more than 50 distinct uncoded values are found for an item, then these distinct values are converted to one range of values. This test and subsequent conversion are accomplished by outputting the frequencies generated and counting the number of records in the resulting output file. If the number of records in the output file exceeds the maximum number of values allowed, then the outputted values are converted to a range for numeric values or simply labeled using the desired metadata label for character values. If the number of records in the output file is less than the maximum number of values allowed, then the outputted values are simply displayed as they are. Since SAS sorts frequencies by default and the frequencies have just been saved to a file, it is very straightforward at this point to create the range of values. The first record in the output frequency file becomes the from value in the range while the last record in the output frequency file becomes the to value in the range. 4.3 HANDLING FLOATING POINT NUMBER LIMITATIONS Once the range issue had been solved, the application worked well but periodically the output for a given item contained one of the range delimiters as its own value record, in addition to a formatted range of values. This duplication only happens with floating point numeric values or numbers with decimal places. After looking through the temporary datasets, it was discovered that the numbers don t match exactly, as they are off in the outermost decimal places. This mismatch is due to the limitations of floating point numeric representation which exists in nearly every software package and hardware device. With some research 1, it was determined that there is a fuzz value that can be used in the format datasets that tells SAS to ignore differences less than a certain precision value. Since the differences are all past six decimal places and that level of precision is not required, the fuzz value in the numeric formats is set to.00001. This resolves all of the data misrepresentations. Page 4 of 7

4.4 ORDERING THE FINAL RESULTS Sorting the output values is not a trivial task. The values for an item can be either character or numeric. There are significant differences between sorting numeric values and sorting character values and an algorithm was needed that would work in all cases. Since the maximum length of a coded value was decided upon a priori to be 40 characters in our database, we chose to create a special character variable (dom_val_sort) in the database that could be used for sorting the values, also with a length of 40. If the coded value was numeric, the value of dom_val_sort was front-filled with blanks. This way instead of 40 preceding 4 when sorting with the coded value itself, the value of 4 would always precede the value of 40 when sorted using dom_val_sort. Conversely, if the coded value was character, the value of dom_val_sort was back-filled with blanks. Finally, to ensure that the MISSING values are always displayed last, the dom_val_sort value was set to a 40 character Z filled string so that missing records would always be displayed last in the outputted frequencies. 4.5 UPDATING THE DATABASE In order to produce the HTML output using the previously developed web application, the database needs to be updated to include the frequencies as well as the newly created sort order variable (dom_val_sort). This was accomplished using a simple proc append statement. In the very first attempt at updating the database, the program elicited the following error: Unable to update a Sybase table with an Identity field with SAS V8.2. After more research 2, it was discovered that this was a known error in SAS V8.2 and required the download and installation of SAS technical support hotfix 82SB09. After applying the hotfix, the program was then able to successfully update the database. 4.6 RESULTS Below are the same codebook excerpts shown earlier from the NHANES 2001-2002 Cardiovascular Fitness Examination codebook. These excerpts are from the new codebooks. Note that these examples now include the automatically-computed, formatted frequencies: CVQ220m Priority 2 Stop, other specified reasons English Text: Reason for Priority 2 Stop: Other specified reasons Code or Value Description Count Skip to Item 1 Yes 42 2 No 411. Missing 4699 Page 5 of 7

CVDEXLEN Length of CV fitness exam (min) English Text: Length of the CV fitness exam (minutes) Code or Value Description Count Skip to Item 0 to 36.73 Range of Values 5152. Missing 0 5. CONCLUSIONS By combining existing metadata with survey release data, it is possible to take a long, tedious, very userinvolved process and turn it into an easy to use, automated, SAS/IntrNet program. In prior releases, codebooks were tediously created via manual data entry into Microsoft FrontPage. The frequency files were also manually created from user-defined macros for each survey item. Moving forward, it is now possible to combine the codebook information with formatted frequencies into a singular output file and produce this file automatically without any manual user intervention. This significantly speeds up and simplifies the release process and offers the end users an easier-to-use, fully integrated data dictionary complete with frequencies. 6. REFERENCES 1. Pete Lund, More than Just Value: A Look into the Depths of PROC FORMAT, SAS Users Group International 27th Annual Conference Proceedings - http://www2.sas.com/proceedings/sugi27/p004-27.pdf 2. SAS Technical Support Web Site, SN-010867, Unable to update a Sybase table with an Identity field with SAS V8.2 - http://support.sas.com/techsup/unotes/sn/010/010867.html 7. ACKNOWLEDGEMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. Page 6 of 7

8. CONTACT INFORMATION Tim Tilert Centers for Disease Control and Prevention / National Center for Health Statistics 3311 Toledo Rd. Hyattsville, MD20782 Work phone: (301) 458-4284 Fax: (301) 458-4029 E-mail: tnt6@cdc.gov Date Last Modified: September 7, 2004 Submitted to: The Northeast SAS Users Group Page 7 of 7