%MISSING: A SAS Macro to Report Missing Value Percentages for a Multi-Year Multi-File Information System

Similar documents
Taming a Spreadsheet Importation Monster

Building Sequential Programs for a Routine Task with Five SAS Techniques

Data Quality Review for Missing Values and Outliers

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

Essential ODS Techniques for Creating Reports in PDF Patrick Thornton, SRI International, Menlo Park, CA

Contents of SAS Programming Techniques

Validation Summary using SYSINFO

A SAS Macro for Producing Benchmarks for Interpreting School Effect Sizes

Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

Useful Tips When Deploying SAS Code in a Production Environment

SAS Macro Dynamics - From Simple Basics to Powerful Invocations Rick Andrews, Office of the Actuary, CMS, Baltimore, MD

How to Keep Multiple Formats in One Variable after Transpose Mindy Wang

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

How to Create Data-Driven Lists

Macro Quoting: Which Function Should We Use? Pengfei Guo, MSD R&D (China) Co., Ltd., Shanghai, China

Statistics, Data Analysis & Econometrics

Better Metadata Through SAS II: %SYSFUNC, PROC DATASETS, and Dictionary Tables

Create Metadata Documentation using ExcelXP

Super boost data transpose puzzle

ABSTRACT INTRODUCTION TRICK 1: CHOOSE THE BEST METHOD TO CREATE MACRO VARIABLES

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

WHAT ARE SASHELP VIEWS?

An Easy Route to a Missing Data Report with ODS+PROC FREQ+A Data Step Mike Zdeb, FSL, University at Albany School of Public Health, Rensselaer, NY

Paper S Data Presentation 101: An Analyst s Perspective

SAS Macro. SAS Training Courses. Amadeus Software Ltd

A Format to Make the _TYPE_ Field of PROC MEANS Easier to Interpret Matt Pettis, Thomson West, Eagan, MN

... ) city (city, cntyid, area, pop,.. )

Tired of CALL EXECUTE? Try DOSUBL

Open Problem for SUAVe User Group Meeting, November 26, 2013 (UVic)

So Much Data, So Little Time: Splitting Datasets For More Efficient Run Times and Meeting FDA Submission Guidelines

The Dataset Diet How to transform short and fat into long and thin

SAS Macro Dynamics: from Simple Basics to Powerful Invocations Rick Andrews, Office of Research, Development, and Information, Baltimore, MD

Macro to compute best transform variable for the model

Quick Data Definitions Using SQL, REPORT and PRINT Procedures Bradford J. Danner, PharmaNet/i3, Tennessee

Greenspace: A Macro to Improve a SAS Data Set Footprint

Chapter 6: Modifying and Combining Data Sets

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

Efficient Processing of Long Lists of Variable Names

Reading in Data Directly from Microsoft Word Questionnaire Forms

INTRODUCTION TO SAS HOW SAS WORKS READING RAW DATA INTO SAS

SAS Programming Techniques for Manipulating Metadata on the Database Level Chris Speck, PAREXEL International, Durham, NC

Exporting Variable Labels as Column Headers in Excel using SAS Chaitanya Chowdagam, MaxisIT Inc., Metuchen, NJ

Missing Pages Report. David Gray, PPD, Austin, TX Zhuo Chen, PPD, Austin, TX

SAS Online Training: Course contents: Agenda:

Multiple Graphical and Tabular Reports on One Page, Multiple Ways to Do It Niraj J Pandya, CT, USA

TLFs: Replaying Rather than Appending William Coar, Axio Research, Seattle, WA

Using SAS Macros to Extract P-values from PROC FREQ

Using PROC REPORT to Cross-Tabulate Multiple Response Items Patrick Thornton, SRI International, Menlo Park, CA

PharmaSUG China Paper 70

From Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX

Introduction. Getting Started with the Macro Facility CHAPTER 1

Square Peg, Square Hole Getting Tables to Fit on Slides in the ODS Destination for PowerPoint

Tales from the Help Desk 6: Solutions to Common SAS Tasks

A Mass Symphony: Directing the Program Logs, Lists, and Outputs

SAS 9 Programming Enhancements Marje Fecht, Prowerk Consulting Ltd Mississauga, Ontario, Canada

Plot Your Custom Regions on SAS Visual Analytics Geo Maps

Submitting SAS Code On The Side

A Macro that can Search and Replace String in your SAS Programs

PhUSE US Connect 2018 Paper CT06 A Macro Tool to Find and/or Split Variable Text String Greater Than 200 Characters for Regulatory Submission Datasets

Base and Advance SAS

Developing Data-Driven SAS Programs Using Proc Contents

SAS Training BASE SAS CONCEPTS BASE SAS:

A Practical and Efficient Approach in Generating AE (Adverse Events) Tables within a Clinical Study Environment

Using Templates Created by the SAS/STAT Procedures

ABSTRACT MORE THAN SYNTAX ORGANIZE YOUR WORK THE SAS ENTERPRISE GUIDE PROJECT. Paper 50-30

Unlock SAS Code Automation with the Power of Macros

Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research

Indenting with Style

Matt Downs and Heidi Christ-Schmidt Statistics Collaborative, Inc., Washington, D.C.

A Quick and Gentle Introduction to PROC SQL

Tracking Dataset Dependencies in Clinical Trials Reporting

Mimicking the Data Step Dash and Double Dash in PROC SQL Arlene Amodeo, Law School Admission Council, Newtown, PA

ODS DOCUMENT, a practical example. Ruurd Bennink, OCS Consulting B.V., s-hertogenbosch, the Netherlands

Uncommon Techniques for Common Variables

SAS Visual Analytics Environment Stood Up? Check! Data Automatically Loaded and Refreshed? Not Quite

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

Know Thy Data : Techniques for Data Exploration

Using PROC SQL to Generate Shift Tables More Efficiently

A Cross-national Comparison Using Stacked Data

Why choose between SAS Data Step and PROC SQL when you can have both?

Run your reports through that last loop to standardize the presentation attributes

Your Own SAS Macros Are as Powerful as You Are Ingenious

A Macro to Create Program Inventory for Analysis Data Reviewer s Guide Xianhua (Allen) Zeng, PAREXEL International, Shanghai, China

Facilitate Statistical Analysis with Automatic Collapsing of Small Size Strata

MOBILE MACROS GET UP TO SPEED SOMEWHERE NEW FAST Author: Patricia Hettinger, Data Analyst Consultant Oakbrook Terrace, IL

Dictionary.coumns is your friend while appending or moving data

Implementing external file processing with no record delimiter via a metadata-driven approach

Paper HOW-06. Tricia Aanderud, And Data Inc, Raleigh, NC

Arthur L. Carpenter California Occidental Consultants, Oceanside, California

Customized Flowcharts Using SAS Annotation Abhinav Srivastva, PaxVax Inc., Redwood City, CA

Purchase this book at

ODS TAGSETS - a Powerful Reporting Method

Techdata Solution. SAS Analytics (Clinical/Finance/Banking)

Acknowledgments xi Preface xiii About the Author xv About This Book xvii New in the Macro Language xxi

Paper B GENERATING A DATASET COMPRISED OF CUSTOM FORMAT DETAILS

Proc Tabulate: Extending This Powerful Tool Beyond Its Limitations

Paper An Automated Reporting Macro to Create Cell Index An Enhanced Revisit. Shi-Tao Yeh, GlaxoSmithKline, King of Prussia, PA

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ

Let s Get FREQy with our Statistics: Data-Driven Approach to Determining Appropriate Test Statistic

Virtual Accessing of a SAS Data Set Using OPEN, FETCH, and CLOSE Functions with %SYSFUNC and %DO Loops

Transcription:

%MISSING: A SAS Macro to Report Missing Value Percentages for a Multi-Year Multi-File Information System Rushi Patel, Creative Information Technology, Inc., Arlington, VA ABSTRACT It is common to find missing values in datasets. An analyst deals with missing values in a variety of ways that range from simply ignoring these records to robust imputations. The decision to choose a strategy to tackle missing values depends upon their magnitude and relevance of the corresponding variable. This paper presents a macro that outputs the missing value percentages for all variables in all files (or for selected variables in selected files) in a multi-year information system. The focus is on using SAS Macro Language to dynamically write pieces of code, MISSING and NMISS functions, iterative macro execution and ODS to export the final report. The intended audience is beginner to intermediate programmers interested in automating tasks through utility macros. INTRODUCTION Missing values are a reality of datasets. An analyst has several options to tackle missing values. However, the option chosen is largely a function of two factors. First, is the variable in question and second is the extent to which that variable is coded as missing. This paper describes a macro coded to generate a report that enlists the percentage of missing values for each variable present in Highway Safety Information System (HSIS), a multi-state, multi-file, multi-year information system. HSIS is comprised of nine geographic states and each of these states has 4 to 10 files. Again, each file is present for any or all years from 1985 onward. Initially the paper gives an overview of the layout of files, followed by a brief description of the basic procedures and functions used to compute the missing percentages. The focus then shifts to describing a routine that generates the required pieces of code through a series of macros that are invoked iteratively within a blanket macro (%MISSING). LAYOUT OF DATA FILES IN THE INFORMATION SYSTEM Broadly speaking, each file in HSIS can either be a crash (accident) based file or a segment based file. A segment based file has a specific length for each observation, while crash based files are point files. For segment based files, the desired output is a percentage of missing values as a function of total length within the file, while for crash based files the desired output is a function of total number of observations. Both sample data files are listed below. /* crash based file */ Data <state>_<file>_<year> Input $a b Cards a 1 b. 2 C 3 Run /* segment based file */ Data <state>_<file>_<year> Input $a c length Cards a 4 10 3 10 c. 20 Run The output for a crash based file will show 25% missing (1 in 4 observations) for variables a and b. While for segment based file, the output for variable a will be 25% missing (10 units of length out of a total of 40) and for variable c will be 50% missing (20 units of length out of a total of 40). BASIC PROCEDURES AND FUNCTIONS NMISS() and COUNT() functions do the trick for crash based files. NMISS() returns the number of missing values for a variable or a list of variables passed through it. COUNT() returns the number of observations in any dataset. The following template of code computes the missing value percentages for the specified variables in a dataset. proc sql create table <table> as select nmiss(<variable1>) / count(*) as miss_<variable1>_pct, nmiss(<variable2>)/ count(*) as miss_<variable2>_pct. from <state>_<file>_<year> quit 1

MISSING() and SUM() functions are used to compute the percentage of missing values for segment based files. MISSING() returns 0 or 1 for a data point that is present or missing. SUM() returns the summation of a variable across the entire dataset. The following template computes the required percentages for segment-based files. proc sql create table <table> as select sum(missing(<variable1>) *seg_lng) / sum(seg_lng) as miss_<variable1>_pct, sum(missing(<variable2>) *seg_lng) / sum(seg_lng) as miss_<variable2>_pct. from <state>_<file>_<year> DESIRED OUTPUT Table 1 shows a sample of the desired output. %MISSING generates this output for all years that a file is available for all statefile combinations. The value in a year-variable cell represents the missing value percentage depending upon whether a file is crash based or segment based. A. indicates a variable is not present for that year. Table 1: Report for Minnesota Accident data Variables Year 2000 Year 2001 Year 2002 Year 2003 Variable 1 10.4 13.5 0 12.1 Variable 2.. 100 100 Variable 3 12 12 12 12 Variable 4 0 0.. MACRO STRUCTURE %MISSING consists of a series of macros that are iteratively invoked within it. The steps involved are 1. For each file available for a particular state, determine whether that file is segment based or crash based. This is done within %MISSING and a global macro variable rdcheck is assigned a value Y (indicating a segment based file) or N (indicating a crash based file). 2. Based on whether rdcheck is Y or N, generate the relevant select statements within a PROC SQL routine. This is done by two support macros, %CRASH_BASED or %SEGMENT_BASED, which are invoked conditionally within another macro %COMPUTE_PCT. 3. Invoke %COMPUTE_PCT for all state-file-year combinations. This is done by populating a series of macro variable arrays and stepping into those iteratively through %do loops within %MISSING. 4. Collect, organize and restructure the output generated from each execution of %COMPUTE_PCT. This is also done within %MISSING using PROC TRANSPOSE, APPEND and DATA steps. 5. Generate the desired reports through ODS done within %MISSING. A pre-requisite for executing %MISSING is a dataset names that has three variables state, file and year. This dataset has all the possible state-file-year combinations. The first PROC SQL routine in %MISSING refers to this dataset. Also, by screening dataset names, a user can execute %MISSING on a limited part of HSIS. Libname data points to an ORACLE database, on which HSIS resides. proc contents data = data._all_ noprint out=temp (keep=memname) data names (keep = st yr file) set temp by memname if first.memname len=length(memname) st=substr(memname,1,2) yr=substr(memname,3,2) start=5 run=len-start+1 file = substr(memname,start,run) run 2

SUPPORT MACROS - %CRASH_BASED and %SEGMENT_BASED options mprint %macro crash_based (vars = ) %local i v list %let i = 1 %let v = %scan(&vars,&i) %do %while (%length(&v) ^= 0) %let list = &list, (nmiss(&v)/count(*)) * 100 as miss_&v._pct %let i = %eval( &i + 1 ) %let v = %scan(&vars,&i) %end %substr(%superq(list),2) %mend crash_based A macro variable array consisting of all variable names present in a particular state-year-file combination is passed to either of these macros (parameter vars=). This macro variable array is populated within %COMPUTE_PCT through a PROC SQL routine before executing the support macros. The macro quoting function %superq() is used to mask the special characters generated within the code. The details of this and related quoting functions are beyond the scope of this paper and the reader is referred to some prior work by Whitlock (1). %SEGMENT_BASED is similar to %CRASH_BASED and is not listed here. %COMPUTE_PCT %macro compute_pct ( root =, file =, year = ) proc sql noprint select lowcase(name) into :vars separated by " " from dictionary.columns where libname = "DATA" and memname = "%upcase(&root&year&branch)" create table temp as %if &rdcheck = Y %then %do select %segment_based( vars = &vars ) from data.&root&year&branch %if &rdcheck = N %then %do select %crash_based( vars = &vars ) from data.&root&year&branch quit proc transpose data = temp out = temp data temp format miss_pct 4.2 length varname $17 set temp ( rename = ( _name_ = varname col1 = miss_pct ) ) label varname = "Variable" 3

varname = substr( varname, 6 ) varname = substr ( varname, 1, length ( varname ) - 4 ) year = &year run %mend compute_pct Parameters passed to this macro are root= (state), file= (name of the file) and year= (each year for a state-file combination), the values of which are generated in %MISSING. It checks the rdcheck variable value and executes either of the two support macros based on its value. It populates a macro variable array vars which is passed to the support macros. PROC TRANSPOSE and the following DATA step is for organizing the output. %MISSING This macro manages the execution of the %COMPUTE_PCT and is referred in this paper as the blanket macro. This macro creates the macro variables that are passed to %COMPUTE_PCT, through a series of %do loops. It takes all the output from %COMPUTE_PCT and appends it in a file, creating a separate report for each state-file combination. Finally, it outputs the report as a MS Word file for each state-file combination to the designated directories using Output Delivery System (ODS). %macro MISSING %global rdcheck select distinct(st) into: state separated by from names quit %let i=1 %let root=%scan(&state,&i) %do %while(%length(&root) > 0) select distinct(file) into: FILES&root separated by ' ' from names where st="&root" run %let j=1 %let branch=%scan(&&files&root,&j) %do %while(%length(&branch) > 0) select yr into: &root.&branch separated by ' ' from names where st="&root" and file = "&branch" run %let k=1 %let getyr = %scan(&&&root.&branch,&k) %do %while (%length(&getyr) > 0) 4

select quote(trim(lowcase(name))) into:dum separated by "," from dictionary.columns where libname = "DATA" and memname = "%upcase(&root&year&branch)" select distinct case when "seg_lng" in (&dum) then 'Y' else 'N' end as var1 into : rdcheck from dictionary.columns where libname = "DATA" and memname = "%upcase(&root&year&branch)" %compute_pct( root=&root, file=&branch, year=&getyr ) proc append base = dat1 data = temp run %let k = %eval(&k + 1) %let getyr = %scan(&&&root.&branch, &k) proc sort data = dat1 by varname year proc transpose data = dat1 out = dat1 ( drop = _name_ ) prefix = year by varname id year ods rtf file = "d:\missing Drive\&root\&branch..rtf" proc print title "Percent Missing for &root. &branch. files" ods rtf close proc datasets lib= work delete dat1 temp run %let j=%eval(&j+1) %let branch=%scan(&&files&root,&j) %let i=%eval(&i+1) %let root=%scan(&state,&i) %mend MISSING CONCLUSION The main feature of this macro is that it can generate the required report automatically, with minimal user intervention. Also, with simple modifications, it can be easily executed on a limited part of the information system. Along with determining all variables in any dataset in the information system, it also determines the type of the dataset. The output allows the user to 5

identify variables with high percentage missing observations and also variables with an abrupt change in percentage of missing observations. Finally, it outputs the report as a MS Word file in a specified directory for future reference and sharing. ACKNOWLEDGEMENTS I thank Creative Information Technology, Inc. (CITI) for sponsoring this paper. CITI is currently providing SAS services through engagements with U.S. Federal Government clients at the Departments of Housing and Urban Development and Transportation. I am grateful to Mr. Ian Whitlock for his immensely useful suggestions in response to my inquiry on SAS-L. The basic macro design discussed here was suggested by Mr. Whitlock. Many thanks, to Mr. Yusuf Mohamedshah and Dr. Forrest Council at Highway Safety Information System (HSIS) for their continued encouragement to develop macros to streamline HSIS data analysis tasks. And last but not the least I thank my wife and my mother for being a constant source of support and motivation. REFERENCES 1. Whitlock, Ian. A serious look at Macro Quoting. Paper 11, Proceedings of twenty eighth annual SAS-User Group International Conference. Available online at www2.sas.com/proceedings/sugi28/011-28.pdf 2. Highway Safety Information System website. www.hsisinfo.org. AUTHOR CONTACT Your comments and questions are valued and encouraged. Contact the author at: Rushi B. Patel Creative Information Technology, Inc., 1010 N. Glebe Road, Suite 710, Arlington, VA 22201 E-mail 1: rushi.b.patel@gmail.com E-mail 2: rpatel@citi-us.com Phone: 240-383-6207 SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. 6