Omitting Records with Invalid Default Values

Similar documents
Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

Cleaning up your SAS log: Note Messages

An Easy Way to Split a SAS Data Set into Unique and Non-Unique Row Subsets Thomas E. Billings, MUFG Union Bank, N.A., San Francisco, California

OLAP Drill-through Table Considerations

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University

Checking for Duplicates Wendi L. Wright

Matt Downs and Heidi Christ-Schmidt Statistics Collaborative, Inc., Washington, D.C.

Quality Control of Clinical Data Listings with Proc Compare

Beginner Beware: Hidden Hazards in SAS Coding

ECLT 5810 SAS Programming - Introduction

How to Keep Multiple Formats in One Variable after Transpose Mindy Wang

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

Data Manipulation with SQL Mara Werner, HHS/OIG, Chicago, IL

Multiple Facts about Multilabel Formats

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

SAS PROGRAMMING AND APPLICATIONS (STAT 5110/6110): FALL 2015 Module 2

The Power of Combining Data with the PROC SQL

INTRODUCTION THE FILENAME STATEMENT CAPTURING THE PROGRAM CODE

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

Exporting Variable Labels as Column Headers in Excel using SAS Chaitanya Chowdagam, MaxisIT Inc., Metuchen, NJ

A Practical Introduction to SAS Data Integration Studio

Hidden in plain sight: my top ten underpublicized enhancements in SAS Versions 9.2 and 9.3

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

Going Under the Hood: How Does the Macro Processor Really Work?

CMISS the SAS Function You May Have Been MISSING Mira Shapiro, Analytic Designers LLC, Bethesda, MD

Step through Your DATA Step: Introducing the DATA Step Debugger in SAS Enterprise Guide

PharmaSUG Paper PO12

Setting Up the Randomization Module in REDCap How-To Guide

A Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA

Let Hash SUMINC Count For You Joseph Hinson, Accenture Life Sciences, Berwyn, PA, USA

Standard Safety Visualization Set-up Using Spotfire

If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC

ABSTRACT INTRODUCTION PROBLEM: TOO MUCH INFORMATION? math nrt scr. ID School Grade Gender Ethnicity read nrt scr

%ANYTL: A Versatile Table/Listing Macro

Creating Forest Plots Using SAS/GRAPH and the Annotate Facility

Automating the Production of Formatted Item Frequencies using Survey Metadata

Paper HOW-06. Tricia Aanderud, And Data Inc, Raleigh, NC

Epidemiology Principles of Biostatistics Chapter 3. Introduction to SAS. John Koval

It s Proc Tabulate Jim, but not as we know it!

Using PROC SQL to Generate Shift Tables More Efficiently

Greenspace: A Macro to Improve a SAS Data Set Footprint

Clinical Data Visualization using TIBCO Spotfire and SAS

Efficient Processing of Long Lists of Variable Names

How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?

T.I.P.S. (Techniques and Information for Programming in SAS )

Create a SAS Program to create the following files from the PREC2 sas data set created in LAB2.

Importing CSV Data to All Character Variables Arthur L. Carpenter California Occidental Consultants, Anchorage, AK

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ

Using a Fillable PDF together with SAS for Questionnaire Data Donald Evans, US Department of the Treasury

Thematic Reference for EvaluationWeb File Upload Validation Rules

Combining Contiguous Events and Calculating Duration in Kaplan-Meier Analysis Using a Single Data Step

Remember to always check your simple SAS function code! Yingqiu Yvette Liu, Merck & Co. Inc., North Wales, PA

PDF Multi-Level Bookmarks via SAS

Authoring Business Rules in IBM Case Manager 5.2

Not Just Merge - Complex Derivation Made Easy by Hash Object

Interleaving a Dataset with Itself: How and Why

How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?

NetView International Interface User Guide

Data Acquisition and Integration

Paper S Data Presentation 101: An Analyst s Perspective

Using SAS Enterprise Guide to Coax Your Excel Data In To SAS

Batch vs. Interactive: Why You Need Both Janet E. Stuelpner. ASG. Inc Cary. North Carolina

Best Practice for Creation and Maintenance of a SAS Infrastructure

A Picture Is Worth A Thousand Data Points - Increase Understanding by Mapping Data. Paul Ciarlariello, Sinclair Community College, Dayton, OH

How to write ADaM specifications like a ninja.

The Dataset Diet How to transform short and fat into long and thin

Easy CSR In-Text Table Automation, Oh My

Data Manipulations Using Arrays and DO Loops Patricia Hall and Jennifer Waller, Medical College of Georgia, Augusta, GA

2. Don t forget semicolons and RUN statements The two most common programming errors.

Revised as of 2/26/05. This version replaces the version previously distributed in class

Simplifying Your %DO Loop with CALL EXECUTE Arthur Li, City of Hope National Medical Center, Duarte, CA

Useful Tips When Deploying SAS Code in a Production Environment

The INPUT Statement: Where

A Three-piece Suite to Address the Worth and Girth of Expanding a Data Set. Phil d Almada, Duke Clinical Research Institute, Durham, North Carolina

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

Hey You! Or To Whom It May Concern

A Cross-national Comparison Using Stacked Data

Unit-of-Analysis Programming Vanessa Hayden, Fidelity Investments, Boston, MA

EXAM - A Clinical Trials Programming Using SAS 9 Accelerated Version. Buy Full Product.

One Project, Two Teams: The Unblind Leading the Blind

Get Started Writing SAS Macros Luisa Hartman, Jane Liao, Merck Sharp & Dohme Corp.

Efficient Elimination of Duplicate Data Using the MODIFY Statement

To conceptualize the process, the table below shows the highly correlated covariates in descending order of their R statistic.

Program Validation: Logging the Log

SAS ENTERPRISE GUIDE USER INTERFACE

Why choose between SAS Data Step and PROC SQL when you can have both?

Using Templates Created by the SAS/STAT Procedures

Automate Clinical Trial Data Issue Checking and Tracking

Missing Pages Report. David Gray, PPD, Austin, TX Zhuo Chen, PPD, Austin, TX

Using Proc Freq for Manageable Data Summarization

A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys

22S:172. Duplicates. may need to check for either duplicate ID codes or duplicate observations duplicate observations should just be eliminated

Hash Objects for Everyone

Multiple Linear Regression Excel 2010Tutorial For use when at least one independent variable is qualitative

Taming a Spreadsheet Importation Monster

using and Understanding Formats

USING SAS HASH OBJECTS TO CUT DOWN PROCESSING TIME Girish Narayandas, Optum, Eden Prairie, MN

A Taste of SDTM in Real Time

The Art of Defensive Programming: Coping with Unseen Data

Transcription:

Paper 7720-2016 Omitting Records with Invalid Default Values Lily Yu, Statistics Collaborative Inc. ABSTRACT Many databases include default values that are set inappropriately. These default values may be scattered throughout the database. There may be no pattern to their occurrence. Records without valid information should be omitted from statistical analysis. Choosing the best method to omit records with invalid values can be difficult. Manual corrections are often prohibitively time-consuming. SAS offers useful and efficient approaches: a combined DATA step with SQL or SORT procedures. This paper provides a step-by-step tutorial to accomplish this task. INTRODUCTION There are times when we need to delete records based on missing values in a dataset. For example, a top-level Yes/No question may be followed by a group of check boxes, to which a respondent may indicate multiple answers. When creating the database, programmers may choose to set a default value of 0 for each of the check boxes under a top level question. One should interpret these default values with caution. Depending on whether the top-level question is answered Yes or No, or is missing, the values for the sub-group questions are treated as missing when the top-level question is missing, even though subgroup questions have values of 0 from the default setting. A similar scenario occurs with a Yes/No or Checked/Not Checked question where No or Not Checked is the default value if the question is not answered (the value actually should be missing). Programmers should understand the difference between missing values and invalid values (that is, incorrectly set defaults) because it is important to handle these records differently. Clearly records should be omitted when their values are all missing for all the variables, but records may also need to be deleted even if we see values for some of the variables when the values are meaningless. There are many ways to select and to delete records. It is much easier if the deletion is based on certain variables; for example, if we want to delete only the male with HIV population or delete women age <= 18 records from a demographic dataset, we could use a conditional IF statement: if upcase(gender)= MALE and upcase(hiv)= POSITIVE then delete ; if AGE <=18 and upcase(gender)= FEMALE then delete ; In most instances, we use conditional statements based on given conditions to delete un-wanted records. When we want to use the full set of the variables as the condition of a deletion, programming becomes more challenging, especially when dealing with a large database having hundreds of variables. These situations are hard to detect since not all of the values are missing. Some variables have values that should be treated as missing, since they provide no valid information, and can even dilute or mislead the results. Though they are not missing from what we see in the database, they should be considered as missing since the values were pre-set as default when the database was set up. Database designers set up defaults to simplify the data collection process. Basically, the data process needs to be short in data entry time and the collected data needs to be as accurate as possible. The hope is that setting up default values will facilitate the data entry process. Another thing that is often done is to carry over the key variable values from screen to screen, or from page to page. For example, subject ID information can be carried over from visit to visit. For a database with multiple records at each of the visits, the visit number and date information can also be carried over. 1

This approach can be good, because those records would not only be entered faster but also for the sake of consistency and accuracy when moving from one record to the next. On the other hand though, it makes harder to detect erroneous records. There are times that data entry people accidentally hit the next record button when there is no data to be entered for that record and they forget to delete or remove that data entry page later. These invalid records are kept along with the valid records leaving more work for the programmers to deal with later on. They must know how to sort out invalid records from valid records and how to delete the invalid data. We see more and more database set-ups like this as the developers rely on programmers to omit the records programmatically at the data cleaning and analysis phases. For illustration purposes, an example partial CRF page is illustrated below (PE Physical Examination): Subject Name: Visit: Date: Age: Gender: 1. Male 2. Female Is physical Examination done?: 1. Yes 0. No If Yes, any significant abnormality: Body System Abnormality 1. Immunologic/Allergy 1. Yes 0. No 2. HEENT 1. Yes 0. No 3. Respiratory 1. Yes 0. No 4. Cardiovascular 1. Yes 0. No 5. Gastrointestinal 1. Yes 0. No 6. Urogenital 1. Yes 0. No 7. Musculoskeletal 1. Yes 0. No. 1. Yes 0. No 12. Other 1. Yes 0. No Based on this CRF, the example dataset used for this paper is as below: 1 Chris 1 20 M 1 1 1 1 0 0 2 Elaine 1 36 F 0 0 0 0 0 0 3 John 1.. 0 0 0 0 0 4 Mary 1 F 1 1 0 1 0 0 5 Mary 2.. 0 0 0 0 0 6 Paul 1 18 M 1 0 1 0 1 0 2

We have three tasks here: 1. Identify a record that needs to be omitted and populate it for all records. 2. Create a temporary dataset to be used as a comparison dataset. 3. Delete the identified records and create a new dataset with the invalid records omitted. The following goes through these steps one by one. ** Create a PE dummy dataset ; data PE ; infile datalines delimiter=','; input subj_id $ visit age gender $ QQ Q1 Q2 Q3 Q4 Q5 ; datalines; Chris, 1, 20, M, 1, 1, 1, 1, 0, 0 Elaine, 1, 36, F, 0, 0, 0, 0, 0, 0 John, 1,,,, 0, 0, 0, 0, 0 Mary, 1,, F, 1, 1, 0, 1, 0, 0 Mary, 2,,,, 0, 0, 0, 0, 0 Paul, 1, 18, M, 1, 0, 1, 0, 1, 0 1. IDENTIFY A RECORD THAT NEEDS TO BE OMITED AND POPULATE IT FOR ALL RECORDS: ** Get template from a blank record ; data DUMMY ; set PE (where=(subj_id='john')) nobs=last_obs ; do I=1 to LAST_OBS ; output ; end ; drop I SUBJ_ID VISIT ; We are not only picking the record, but will also need to populate it within each of the records in the DUMMY dataset so that we can then compare the records one by one with the PE dataset. The NOBS option from the set statement is used to get the last record number so that it can be used as an end point for the DO loop. There is also a need to delete the unique variable information (John s SUBJ_ID and VISIT in this case) for the record used to populate the rest of the records so that they will be generic records when merging the dataset with the one we want to check, that is, PE. The dataset DUMMY is created after running the data step. 2. CREATE A TEMPORARY DATASET TO BE USED AS A COMPARISON DATASET: Here we create a temporary dataset (TEMP) by merging the DUMMY dataset with the PE dataset but only bring in the key variables from PE (Subject ID and Visit, in this example, although it is not limited just by the two variables) to make the TEMP dataset a comparable and completed dataset. By saying comparable and completed, it means that It has the same number of subjects, the same number of records and the same number and names of the variables (of course, the key variables are part of the variables in the dataset) as the PE dataset. The only difference is the contents of the two datasets. The PE dataset has the original information (valid and invalid). The TEMP dataset contains all of the records in PE but those records are populated by an invalid record from PE. 3

data TEMP ; merge PE (keep=subj_id VISIT) DUMMY) ; Merging two datasets with no BY statement is generally not preferable because information has a risk of being combined incorrectly. In this case, as you can see from the previous data steps that the TEMP dataset mimics the number of records and variables from the PE dataset but replaces the values of non-key variables with the ones in DUMMY. So it is not problematic when merging. It performs a One-to-One MERGE by combining the two datasets horizontally. Since there are no common variables in the two datasets, the BY statement does not apply. Observations are combined based upon their relative position in each of the datasets. Proc print data=temp; title "TEMP dataset" ; Now we have created a dataset with one missing record for each of the records in the PE dataset. We will use it in the next step when comparing the two datasets. 3. DELETE THE IDENTIFIED RECORDS AND CREATE A NEW DATASET WITH THEINVALID RECORDS OMITTED. /** Approach 1 Using SQL procedure ; **/ proc sql ; create table PE_FINAL as select * from PE except select * from TEMP ; quit ; Proc print data=pe_final ; title "Approach 1 Using SQL procedure" ; /** end ; **/ The FINAL_PE output: Approach 1 - Using SQL procedure 1 Chris 1 20 M 1 1 1 1 0 0 2 Elaine 1 36 F 0 0 0 0 0 0 3 Mary 1. F 1 1 0 1 0 0 4 Paul 1 18 M 1 0 1 0 1 0 Using SQL, the deletion can be easily done. Here, we create a new dataset (FINAL_PE) without the missing/invalid records by comparing the PE dataset with the TEMP dataset. We then delete the records in PE if they exist in the TEMP dataset and keep the rest if they are not in the TEMP dataset. The EXCEPT statement in the SQL procedure is the key to accomplish this task. 4

/** Approach 2 Using SORT procedure ; **/ ** Before using SORT procedure, combine the two datasets ; data PE0 ; set PE TEMP ; In order to use the SORT procedure as a way to delete the unwanted records, we need to combine the PE and TEMP datasets vertically. It means that you would see twice as many records in the combined dataset. A data step with a set statement can simply combine the two datasets together. Please note that the order of the two datasets matters. The TEMP dataset which contains the missing records must be placed as the second dataset in the SET statement. You will see why in a later SORT procedure. After the merge, two SORT procedures are engaged. 1. The first SORT procedure provides the duplicate record dataset (PE1) and the unique record dataset (PE2): proc sort data=pe0 out=pe1 nouniquerec uniqueout=pe2 ; by SUBJ_ID VISIT ; As we can see, the option UNIQUEOUT must be paired with the NOUNIQUEREC option in order to get the unique dataset PE2. It means that both of them need to be selected in the Proc SORT statement. What would happen if the NOUNIQUEREC option is omitted? A warning will be issued in the log: WARNING: None of the following options have been specified: NODUPKEY, NODUPREC, NOUNIQUEKEY, or NOUNIQUEREC. The DUPOUT= or UNIQUEOUT= data set will contain zero observations. The first SORT procedure outputs PE2 with the paired option being selected: 1 Chris 1 20 M 1 1 1 1 0 0 2 Chris 1.. 0 0 0 0 0 3 Elaine 1 36 F 0 0 0 0 0 0 4 Elaine 1.. 0 0 0 0 0 5 Mary 1. F 1 1 0 1 0 0 6 Mary 1. 0 0 0 0 0 7 Paul 1 18 M 1 0 1 0 1 0 8 Paul 1.. 0 0 0 0 0 By examining the data, we know this is still not what we wanted. We need to use the SORT procedure again in order to delete the invalid records. 2. The second SORT procedure to delete the missing/invalid records in PE2 : proc sort data=pe2 out=pe_final nodupkey ; by SUBJ_ID VISIT ; /** end **/ ; 5

Most of us should have used the NODUPKEY option. Using the NODUPKEY option, the first record of each of the specified key variables (SUBJ_ID and VISIT) will be kept if they have more than one SUBJ_ID and VISIT records in the dataset. In this particular case, the missing records will be deleted since they are all the second records for the key variables in the SORT procedure. Remember we put the TEMP dataset as the second dataset when setting with PE dataset before engaging the SORT procedures. ** The Final dataset without invalid records ; Approach 2 Output using the second SORT procedure: 1 Chris 1 22 M 1 1 1 1 0 0 2 Elaine 1 36 F 0 0 0 0 0 0 3 Mary 1. F 1 1 0 1 0 0 4 Paul 1 18 M 1 0 1 0 1 0 CONCLUSION As seen in the two approaches illustrated above, the first approach with the SQL procedure is a more favorable approach than the second one since it involves the use of fewer procedures and thus is more efficient. The same result can be achieved with multiple methods. This is the beauty of the many aspects provided by SAS. The example dataset shown above is a relatively small one. Greater efficiency (benefits) will be gained with larger datasets. The program described above has been used successfully for datasets with thousands of records and hundreds of variables. REFERENCES SAS Institute Inc. (2010), SAS Procedures, SAS Version 9 Online ACKNOWLEDGMENTS The author would like to extend her appreciation to Statistics Collaborative, Inc. for supporting her participation in the SAS Global Forum 2016 and to her colleagues for reviewing the paper (Janet Wittes, Janelle Rhorer, Lisa Weissfeld and Elizabeth Toretsky). CONTACT INFORMATION Your comments are valuable and encouraged. Please feel free to contact the author about the paper: Lily Yu Sr. Statistical Programmer Statistics Collaborative Inc. 1625 Massachusetts Ave., NW. Suite 600 Washington, DC 20036 202-247-9996 Lily.Yu@statcollab.com www.statcollab.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 6