Data Quality Review for Missing Values and Outliers

Similar documents
Developing Data-Driven SAS Programs Using Proc Contents

Validation Summary using SYSINFO

An Efficient Tool for Clinical Data Check

How to Keep Multiple Formats in One Variable after Transpose Mindy Wang

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

Biostatistics & SAS programming. Kevin Zhang

CC13 An Automatic Process to Compare Files. Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

Automating Preliminary Data Cleaning in SAS

Using SAS Macro to Include Statistics Output in Clinical Trial Summary Table

%MISSING: A SAS Macro to Report Missing Value Percentages for a Multi-Year Multi-File Information System

An Efficient Method to Create Titles for Multiple Clinical Reports Using Proc Format within A Do Loop Youying Yu, PharmaNet/i3, West Chester, Ohio

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

Missing Pages Report. David Gray, PPD, Austin, TX Zhuo Chen, PPD, Austin, TX

Facilitate Statistical Analysis with Automatic Collapsing of Small Size Strata

Automating Comparison of Multiple Datasets Sandeep Kottam, Remx IT, King of Prussia, PA

Correcting for natural time lag bias in non-participants in pre-post intervention evaluation studies

A SAS Macro for Producing Benchmarks for Interpreting School Effect Sizes

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

Quick Data Definitions Using SQL, REPORT and PRINT Procedures Bradford J. Danner, PharmaNet/i3, Tennessee

footnote1 height=8pt j=l "(Rev. &sysdate)" j=c "{\b\ Page}{\field{\*\fldinst {\b\i PAGE}}}";

Utilizing SAS for Cross- Report Verification in a Clinical Trials Setting

ABSTRACT INTRODUCTION TRICK 1: CHOOSE THE BEST METHOD TO CREATE MACRO VARIABLES

So Much Data, So Little Time: Splitting Datasets For More Efficient Run Times and Meeting FDA Submission Guidelines

Automated Macros to Extract Data from the National (Nationwide) Inpatient Sample (NIS)

Hidden in plain sight: my top ten underpublicized enhancements in SAS Versions 9.2 and 9.3

Uncommon Techniques for Common Variables

Paper An Automated Reporting Macro to Create Cell Index An Enhanced Revisit. Shi-Tao Yeh, GlaxoSmithKline, King of Prussia, PA

Quick and Efficient Way to Check the Transferred Data Divyaja Padamati, Eliassen Group Inc., North Carolina.

PharmaSUG Paper SP07

%Addval: A SAS Macro Which Completes the Cartesian Product of Dataset Observations for All Values of a Selected Set of Variables

Prove QC Quality Create SAS Datasets from RTF Files Honghua Chen, OCKHAM, Cary, NC

Keeping Track of Database Changes During Database Lock

A Macro to Create Program Inventory for Analysis Data Reviewer s Guide Xianhua (Allen) Zeng, PAREXEL International, Shanghai, China

Adjusting for daylight saving times. PhUSE Frankfurt, 06Nov2018, Paper CT14 Guido Wendland

Regaining Some Control Over ODS RTF Pagination When Using Proc Report Gary E. Moore, Moore Computing Services, Inc., Little Rock, Arkansas

PharmaSUG Paper PO12

PharmaSUG Paper TT11

PharmaSUG Paper AD06

Efficient Processing of Long Lists of Variable Names

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

Taming a Spreadsheet Importation Monster

An Easy Route to a Missing Data Report with ODS+PROC FREQ+A Data Step Mike Zdeb, FSL, University at Albany School of Public Health, Rensselaer, NY

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

A Few Quick and Efficient Ways to Compare Data

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ

PharmaSUG China 2018 Paper AD-62

Run your reports through that last loop to standardize the presentation attributes

ABSTRACT INTRODUCTION WORK FLOW AND PROGRAM SETUP

3N Validation to Validate PROC COMPARE Output

PharmaSUG Paper SP04

EXAMPLES OF DATA LISTINGS AND CLINICAL SUMMARY TABLES USING PROC REPORT'S BATCH LANGUAGE

Macros for Two-Sample Hypothesis Tests Jinson J. Erinjeri, D.K. Shifflet and Associates Ltd., McLean, VA

A SAS Macro to Create Validation Summary of Dataset Report

Automate Clinical Trial Data Issue Checking and Tracking

Using SAS/SCL to Create Flexible Programs... A Super-Sized Macro Ellen Michaliszyn, College of American Pathologists, Northfield, IL

What's the Difference? Using the PROC COMPARE to find out.

Give me EVERYTHING! A macro to combine the CONTENTS procedure output and formats. Lynn Mullins, PPD, Cincinnati, Ohio

Essential ODS Techniques for Creating Reports in PDF Patrick Thornton, SRI International, Menlo Park, CA

ABC Macro and Performance Chart with Benchmarks Annotation

PhUSE US Connect 2018 Paper CT06 A Macro Tool to Find and/or Split Variable Text String Greater Than 200 Characters for Regulatory Submission Datasets

SAS Macro Technique for Embedding and Using Metadata in Web Pages. DataCeutics, Inc., Pottstown, PA

Using a Control Dataset to Manage Production Compiled Macro Library Curtis E. Reid, Bureau of Labor Statistics, Washington, DC

Using MACRO and SAS/GRAPH to Efficiently Assess Distributions. Paul Walker, Capital One

22S:172. Duplicates. may need to check for either duplicate ID codes or duplicate observations duplicate observations should just be eliminated

... ) city (city, cntyid, area, pop,.. )

Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

A Format to Make the _TYPE_ Field of PROC MEANS Easier to Interpret Matt Pettis, Thomson West, Eagan, MN

A Macro that can Search and Replace String in your SAS Programs

There s No Such Thing as Normal Clinical Trials Data, or Is There? Daphne Ewing, Octagon Research Solutions, Inc., Wayne, PA

MedDRA Dictionary: Reporting Version Updates Using SAS and Excel

Create Metadata Documentation using ExcelXP

Tracking Dataset Dependencies in Clinical Trials Reporting

SAS Macros for Grouping Count and Its Application to Enhance Your Reports

A Mass Symphony: Directing the Program Logs, Lists, and Outputs

Summary Table for Displaying Results of a Logistic Regression Analysis

Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

Using SAS software to fulfil an FDA request for database documentation

One Project, Two Teams: The Unblind Leading the Blind

Using SAS to Analyze CYP-C Data: Introduction to Procedures. Overview

Tweaking your tables: Suppressing superfluous subtotals in PROC TABULATE

ABSTRACT INTRODUCTION MACRO. Paper RF

Create a Format from a SAS Data Set Ruth Marisol Rivera, i3 Statprobe, Mexico City, Mexico

22S:166. Checking Values of Numeric Variables

Let s Get FREQy with our Statistics: Data-Driven Approach to Determining Appropriate Test Statistic

Matt Downs and Heidi Christ-Schmidt Statistics Collaborative, Inc., Washington, D.C.

Preserving your SAS Environment in a Non-Persistent World. A Detailed Guide to PROC PRESENV. Steven Gross, Wells Fargo, Irving, TX

Applications Development

Out of Control! A SAS Macro to Recalculate QC Statistics

Clinical Data Visualization using TIBCO Spotfire and SAS

Creating Macro Calls using Proc Freq

PharmaSUG Paper CC02

Let SAS Help You Easily Find and Access Your Folders and Files

SAS automation techniques specification driven programming for Lab CTC grade derivation and more

Using Proc Freq for Manageable Data Summarization

Open Problem for SUAVe User Group Meeting, November 26, 2013 (UVic)

Better Metadata Through SAS II: %SYSFUNC, PROC DATASETS, and Dictionary Tables

WEB MATERIAL. eappendix 1: SAS code for simulation

Tales from the Help Desk 6: Solutions to Common SAS Tasks

Transcription:

Paper number: PH03 Data Quality Review for Missing Values and Outliers Ying Guo, i3, Indianapolis, IN Bradford J. Danner, i3, Lincoln, NE ABSTRACT Before performing any analysis on a dataset, it is often useful to perform a review of the dataset in order to detect missing values and outliers. The macro provided in this paper is a SAS macro program designed to check the data in a time-efficient and user-friendly way, which can generate the following content: First, the macro can generate an Excel report to determine whether the variables have missing values or outliers, and report the percentage of missing values. Second, the macro can display all records which contain missing values and/or outliers, grouped by variables in an Excel file. Each tab will represent the exception records for one variable. This macro can automatically check all variables regardless of the dataset structure and variable names. Also, an extended version of this macro can perform an automatic check of a whole SAS data library at one time. THE PROBLEM Missing values and outliers create difficulties for most analyses. Without knowing the structure and details of missing data and outliers, problems may arise when analysis is performed, such as undetected outliers which may bias an analysis if these outliers are due to previously undetected errors. Therefore, when receiving datasets with dozens (even hundreds) of variables and thousands of observations, knowing the summary of missing values and outliers can be helpful to identify any upstream problems that may have crept into the data. Currently, a common method for this is to use PROC MEANS, PROC UNIVARIATE, and PROC FREQ for each variable to detect the missing values and outliers. If it is necessary, datasets are created to indicate which records have missing values and/or outliers. Performing checks for each variable, going through output for each procedure, and generating datasets for each variable which have missing values and/or outliers is quite time consuming, especially with large numbers of huge datasets. Individual variable names need to be typed in for each dataset to find missing values and/or outliers and output them to exception datasets. Additionally such programs are not reusable due to different structures and variable names for different studies. This can be very time-consuming. THE SOLUTION In order to expedite review of the datasets and detect missing values and outliers in a more time-efficient way, a SAS macro was developed. This macro will automatically process check each variable in a dataset for missing values and outliers. When missing values and/or outliers are found, the macro will automatically generate an exception dataset which will contain the records with missing values and/or outliers. The exception dataset will be output to an Excel file. Each tab will represent an exception data for one variable. There is no need to type in variable names for each dataset; all that is required is to specify the name and the location of the input dataset and the output Excel file. In this macro, an outlier was defined as a value outside the range for mean ± three standard deviations (SD). In order to eliminate the effect of outliers to the mean, a trimmed mean and SD was used, which cuts off certain percentage of both minimum and maximum data before calculating the mean and SD. The reason is that when there are not enough observations in a dataset, the outliers can significantly affect the mean and SD, thus keeping the outliers from being detected. Automatic checks can be performed on a whole data library without specifying each dataset in the library by name with the extended version of the macro. The results for each dataset are generated individually. THE OUTPUT OF THE MACRO First, the macro produces a summary output. The report will indicate if the variables have missing value or outliers. The percentage of missing value is also reported. Example 1: The dataset named CCC have five variables, name1, name2, number1, number2, number3, while name1 and name2 are character variables, and number1 number2 and number3 are numerical variables. 1

dataset name variable name type check check label TEST name1 2 n Non missing value found TEST name2 2 m missing value found 20 (20.202020202%) TEST number1 1 o outlier found TEST number2 1 m Missing value found 7 (7.07%) TEST number3 1 m Missing value found 7 (7.07%) TEST number3 1 o outlier found Secondly, the macro can generate all records which contain missing values (Example 2) and outliers (Example 3), grouped by variable in the Excel file. In the Excel file, each variable will create a new tab. For character variables, only missing values are reported. For numerical variables, both missing values and outliers are reported. Example 2: Missing values shown for name2, number2, and number3 variables. Example 2a represents all records with the variable name2 missing, Example 2b represents all records with the variable number2 missing, and Example 2c represents all records with the variable number3 missing. Example 2a Example 2b name1 name2 number1 number2 number3 bbb eee 80 26 ccc 92 62 ddd ccc 53 65 bbb 84 30 ddd 29 10 ggg bbb 21 72 ggg bbb 99 99 Example 2c name1 name2 number1 number2 number3 bbb ddd 14 65 ccc 91 73 bbb ccc 45 7 bbb ddd 86 15 bbb ddd 11 63 ddd 73 84 ggg bbb 71 52 Example 3: Outliers are displayed as follows: In Example 3a, the observations which have outliers in variable number1 are shown. In Example 3b, the observations which have outliers in variable number3 are shown. Example 3a name1 name2 number1 number2 number3 ccc 1000 75 63 bbb 590 62 40 Example 3b name1 name2 number1 number2 number3 fff ddd 55 28 3500 ddd ggg 76 87 1000 2

Example 4: In the Excel file, it also gives a summary. MEMNAME NAME TYPE LABEL check checkl TEST name1 2 name1 n Non missing value found TEST name2 2 name2 m missing value found 20 (20.20%) TEST number1 1 number1 o outlier found TEST number2 1 number2 m Missing value found 7 (7.07%) TEST number3 1 number3 m Missing value found 7 (7.07%) TEST number3 1 number3 o outlier found SUMMARY In summary, this macro is a very useful tool to check data efficiently. It can check all data values automatically, regardless of different structures and names for datasets; therefore, checks can be performed in a time-efficient manner. The macro can find potential data issues detected in input datasets, and report these issues properly to the data management team, enabling them to keep the data clean and correct. Finally, the macro can check multiple datasets from the same library. This special feature is especially helpful for people who work on data integration or anything dealing with multiple datasets at the same time. CONCLUSION This macro is simple and practical. It will be a very useful tool for people who want to review and understand data quickly and correctly. ACKNOWLEDGMENTS We would like to thank my collaborator Lixiang Liu in InVentiv Clinical solutions who validated this macro. His comments are really appreciated. CONTACT INFORMATION Your comments and questions are valued and encouraged. Please contact the primary author at: Ying Guo I3 Indianapolis, IN 46285 Email: evelyn.guo@i3statprobe.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. 3

APPENDIX /******User guide****** xlsfile: Input the path and excel file name which the missing and outlier is output to. indata: Input the data name which you want to check. outfile: Input the path and file name for rtf report. trim: Input the percentage you want to trim when calculate mean and SD. ***********************************************************/; /********** Check for one dataset**********/ ods listing close; %macro onedata (xlsfile=, indata=, outfile=, trim=); /*Create content table*/ proc contents data=&indata out=tmp(keep=memname name type label); /*Check frequency for Character variable, mean for numerical variable*/ /*Get the number of character and numerical variable*/ proc freq data=tmp; table type; ods output Freq.Table1.OneWayFreqs=tmp1(keep=type frequency); data _null_; set tmp1; call symput('type' '_' trim(left(type)),frequency); %put &type_1 &type_2; /*Check frequency for Character variable*/ data tmpc; set tmp; if type=2; data _null_; set tmpc; call symput('char' '_' trim(left(_n_)),trim(left(name))); %let i=1; %macro loopcha; %do i=1 %to &type_2; proc freq data=&indata; table &&char_&i/missing; ods output onewayfreqs=&&char_&i(keep=&&char_&i frequency percent); data &&char_&i; set &&char_&i; if &&char_&i=' ' and frequency ne 0 then do; name="&&char_&i"; check='m'; checkl="missing value found" " " trim(left(frequency)) " (" trim(left(percent)) "%)"; 4

proc sort data=&&char_&i nodupkey; by name check; data tmpc; merge tmpc(in=a) &&char_&i(keep=name check checkl); by name; if a; if check = ' ' then do; check='n'; checkl="non missing value found"; %mend loopcha; %loopcha /*output missing values for character variable*/ libname lib "&xlsfile"; data tmpcm; set tmpc; where check='m'; call symput('varc' '_' trim(left(_n_)),trim(left(name))); proc sql noprint; select count(*) into: n from tmpcm; %macro missingvalue; %do i=1 %to &n; data lib.&&varc_&i; set &indata; where &&varc_&i = ' '; %mend missingvalue; %missingvalue; Libname lib clear; /*Check outliner and missing value for numerical variable*/ data tmpn; set tmp; if type=1; data _null_; set tmpn; call symput('num' '_' trim(left(_n_)),trim(left(name))); %let i=1; %macro loopnum; %do i=1 %to &type_1; proc means data=&indata n min max nmiss; var &&num_&i; ods output means.summary=temp; 5

proc univariate data=&indata trimmed=&trim; var &&num_&i; ods output TrimmedMeans=&&num_&i; data &&num_&i; merge &&num_&i temp; data &&num_&i; set &&num_&i; per=&&num_&i.._nmiss/(&&num_&i.._n+&&num_&i.._nmiss); if &&num_&i.._nmiss ne 0 then do; name="&&num_&i"; check='m'; checkl="missing value found" " " trim(left(&&num_&i.._nmiss)) " (" trim(left(round(per, 0.0001)*100)) "%)"; if. < &&num_&i.._min < (mean - 3*stdmean*sqrt(&&num_&i.._n-1)) or &&num_&i.._max > (mean + 3*stdmean*sqrt(&&num_&i.._n-1)) then do; name="&&num_&i"; check='o'; checkl="outlier found"; else proc sort data=&&num_&i nodupkey; by name checkl; data tmpn; merge tmpn(in=a) &&num_&i(keep=name check checkl); by name; if a; if check = ' ' then do; check='n'; checkl="non missing or outlier found"; %mend loopnum; %loopnum /*output missing values for numerical variable*/ libname lib "&xlsfile"; data tmpnm; set tmpn; where check='m'; call symput('varnm' '_' trim(left(_n_)),trim(left(name))); proc sql noprint; select count(*) into: n from tmpnm; %macro missingvalue; %do i=1 %to &n; data lib.m_&&varnm_&i; 6

set &indata; where &&varnm_&i =.; %mend missingvalue; %missingvalue; Libname lib clear; /*output outlier values for numerical variable*/ libname lib "&xlsfile"; data tmpno; set tmpn; where check='o'; call symput('varno' '_' trim(left(_n_)),trim(left(name))); proc sql noprint; select count(*) into: n from tmpno; data &indata; set &indata; id=1; %macro outlier; %do i=1 %to &n; ods listing close; proc univariate data=&indata trimmed=&trim; var &&varno_&i; ods output TrimmedMeans=&&varno_&i (keep=mean stdmean); proc means data=&indata n ; var &&varno_&i; ods output means.summary=temp(rename=(&&varno_&i.._n=n)); ods listing; data &&varno_&i; merge &&varno_&i temp; data &&varno_&i; set &&varno_&i; id=1; data &indata; merge &indata &&varno_&i; by id; data lib.o_&&varno_&i(drop=mean stdmean n id); set &indata; if. < &&varno_&i < (mean - 3*stdmean*sqrt(n-1)) or &&varno_&i > (mean + 3*stdmean*sqrt(n-1)) then data &indata; set &indata (drop=mean stdmean n); 7

%mend outlier; %outlier ods listing; data final; merge tmp tmpc tmpn; by memname name; data lib.summary; set final; Libname lib clear; filename outfile "&outfile"; ods rtf file=outfile; proc report data=final nowindows headline headskip; column memname name type check checkl; define memname / width=15 left "dataset name" flow id; define name / width=10 left "variable name"; define type / width=6 left "type" ; define check / width=6 left "check"; define checkl / width=30 left "check label"; compute after ; ; endcomp; ods rtf close; %mend onedata; /*Example macro call*/; /*%onedata(xlsfile=h:\lixliu\uat\test1.xls, indata=aaa, outfile='h:\lixliu\uat\test2.rtf', trim=0.2)*/ *****************Check for whole library*************; /*Here is an example, libname was define on my local computer.*/; libname gwcv 'Y:\BYETTA._G\GWCV\GLS'; ods output members=gwcvmem; proc datasets mt=data library=gwcv; proc copy out=work in=gwcv; proc sql noprint; select count(*) into:num from gwcvmem; %put &num; data _null_; set gwcvmem; call symput('name' '_' trim(left(_n_)),trim(left(name))); %let j=1; %macro loop; %do j=1 %to &num; 8

%onedata(indata=&&name_&j, outfile=&&name_&j..final.rtf trim=0.2) %mend loop; %loop 9