Keh-Dong Shiang, Department of Biostatistics & Department of Diabetes, City of Hope National Medical Center, Duarte, CA

Similar documents
Effectively Utilizing Loops and Arrays in the DATA Step

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

Standard Safety Visualization Set-up Using Spotfire

Simplifying Your %DO Loop with CALL EXECUTE Arthur Li, City of Hope National Medical Center, Duarte, CA

The Power of Combining Data with the PROC SQL

Clinical Data Visualization using TIBCO Spotfire and SAS

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

An Automation Procedure for Oracle Data Extraction and Insertion

Extending the Scope of Custom Transformations

Get into the Groove with %SYSFUNC: Generalizing SAS Macros with Conditionally Executed Code

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

OLAP Drill-through Table Considerations

Remove this where. statement to produce the. report on the right with all 4 regions. Retain this where. statement to produce the

Advanced Visualization using TIBCO Spotfire and SAS

Automate Clinical Trial Data Issue Checking and Tracking

Essentials of PDV: Directing the Aim to Understanding the DATA Step! Arthur Xuejun Li, City of Hope National Medical Center, Duarte, CA

Programmatic Automation of Categorizing and Listing Specific Clinical Terms

Quick and Efficient Way to Check the Transferred Data Divyaja Padamati, Eliassen Group Inc., North Carolina.

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

Lecture 1 Getting Started with SAS

An Approach Finding the Right Tolerance Level for Clinical Data Acceptance

A Simple Interface for defining, programming and managing SAS edit checks

Mapping Clinical Data to a Standard Structure: A Table Driven Approach

STEP Household Questionnaire. Guidelines for Data Processing

The first approach to accessing pathology diagnostic information is the use of synoptic worksheets produced by the

PharmaSUG Paper TT11

ZIPpy Safe Harbor De-Identification Macros

Matt Downs and Heidi Christ-Schmidt Statistics Collaborative, Inc., Washington, D.C.

Application of Modular Programming in Clinical Trial Environment Mirjana Stojanovic, CALGB - Statistical Center, DUMC, Durham, NC

Beginning Tutorials. PROC FSEDIT NEW=newfilename LIKE=oldfilename; Fig. 4 - Specifying a WHERE Clause in FSEDIT. Data Editing

A SAS/AF Application for Linking Demographic & Laboratory Data For Participants in Clinical & Epidemiologic Research Studies

USING HASH TABLES FOR AE SEARCH STRATEGIES Vinodita Bongarala, Liz Thomas Seattle Genetics, Inc., Bothell, WA

Validation Summary using SYSINFO

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

186 Statistics, Data Analysis and Modeling. Proceedings of MWSUG '95

Cite: CTSA NIH Grant UL1- RR024982

SAS CLINICAL SYLLABUS. DURATION: - 60 Hours

Clinical Optimization

An Efficient Tool for Clinical Data Check

Workshop MedSciNet - Building an Electronic Data Capture System for a Clinical Trial

Table of Contents. The RETAIN Statement. The LAG and DIF Functions. FIRST. and LAST. Temporary Variables. List of Programs.

Applied Information and Communication Technology

Cover the Basics, Tool for structuring data checking with SAS Ole Zester, Novo Nordisk, Denmark

Journey to the center of the earth Deep understanding of SAS language processing mechanism Di Chen, SAS Beijing R&D, Beijing, China

Understanding and Applying the Logic of the DOW-Loop

Hypothesis Testing: An SQL Analogy

There s No Such Thing as Normal Clinical Trials Data, or Is There? Daphne Ewing, Octagon Research Solutions, Inc., Wayne, PA

Import and Browse. Review data. bp_stages is a chart based on a graphic

Taming a Spreadsheet Importation Monster

Psoriasis Registry Graz Austria. User Manual. Version 1.0 Page 1 of 19

SAS Programming Techniques for Manipulating Metadata on the Database Level Chris Speck, PAREXEL International, Durham, NC

Real Time Clinical Trial Oversight with SAS

PharmaSUG Paper CC11

OnCore Enterprise Research. Subject Administration Full Study

JMP Clinical. Release Notes. Version 5.0

ABSTRACT INTRODUCTION THE GENERAL FORM AND SIMPLE CODE

Integrated Safety Reporting Anemone Thalmann elba - GEIGY Ltd (PH3.25), Basel

CHAPTER 7 Using Other SAS Software Products

Clinical Optimization

Paper A Simplified and Efficient Way to Map Variable Attributes of a Clinical Data Warehouse

Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

Data Manipulation with SQL Mara Werner, HHS/OIG, Chicago, IL

Smart Measurement System for Late Phase

Unique Technology for the Future

Let Hash SUMINC Count For You Joseph Hinson, Accenture Life Sciences, Berwyn, PA, USA

PharmaSUG China. Systematically Reordering Axis Major Tick Values in SAS Graph Brian Shen, PPDI, ShangHai

Introduction. Understanding SAS/ACCESS Descriptor Files. CHAPTER 3 Defining SAS/ACCESS Descriptor Files

Let SAS Write and Execute Your Data-Driven SAS Code

How to clean up dirty data in Patient reported outcomes

Parallelizing Windows Operating System Services Job Flows

Paper Haven't I Seen You Before? An Application of DATA Step HASH for Efficient Complex Event Associations. John Schmitz, Luminare Data LLC

Preparing the Office of Scientific Investigations (OSI) Requests for Submissions to FDA

Beginning Tutorials. Introduction to SAS/FSP in Version 8 Terry Fain, RAND, Santa Monica, California Cyndie Gareleck, RAND, Santa Monica, California

Get Started Writing SAS Macros Luisa Hartman, Jane Liao, Merck Sharp & Dohme Corp.

CPRD Aurum Frequently asked questions (FAQs)

Customized Flowcharts Using SAS Annotation Abhinav Srivastva, PaxVax Inc., Redwood City, CA

Statistics and Data Analysis. Common Pitfalls in SAS Statistical Analysis Macros in a Mass Production Environment

SYSTEM 2000 Essentials

So Much Data, So Little Time: Splitting Datasets For More Efficient Run Times and Meeting FDA Submission Guidelines

SOPHISTICATED DATA LINKAGE USING SAS

Applied Information and Communication Technology

EGCI 321: Database Systems. Dr. Tanasanee Phienthrakul

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1

SAS and the Intranet: How a Comments System Creates In-House Efficiencies Morris Lee, Gilead Sciences, Inc., Foster City, California

DBLOAD Procedure Reference

Techdata Solution. SAS Analytics (Clinical/Finance/Banking)

Write SAS Code to Generate Another SAS Program A Dynamic Way to Get Your Data into SAS

How to write ADaM specifications like a ninja.

PharmaSUG Paper PO12

ABSTRACT INTRODUCTION WORK FLOW AND PROGRAM SETUP

Quick Data Definitions Using SQL, REPORT and PRINT Procedures Bradford J. Danner, PharmaNet/i3, Tennessee

Vine Medical Group Patient Registration Form Your Information

Example Candidate Responses. Cambridge International AS & A Level Computer Science. Paper 2

Data Science Centre (DSC) Data Preparation Policy: Guidelines for managing and preparing your data for statistical analysis

Simplifying Effective Data Transformation Via PROC TRANSPOSE

Omitting Records with Invalid Default Values

ABSTRACT DATA CLARIFCIATION FORM TRACKING ORACLE TABLE INTRODUCTION REVIEW QUALITY CHECKS

Now That You Have Your Data in Hadoop, How Are You Staging Your Analytical Base Tables?

One Project, Two Teams: The Unblind Leading the Blind

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University

Transcription:

Validating Data Via PROC SQL Keh-Dong Shiang, Department of Biostatistics & Department of Diabetes, City of Hope National Medical Center, Duarte, CA ABSTRACT The Structured Query Language (SQL) is a standardized language that can be efficiently used to retrieve and update data stored in relational databases. Interestingly, SQL s WHERE-clause rowselection feature can be particularly utilized and applied for data validation and error checks. In this paper, I will present a method of designing and constructing a highly structured validation program, which is applied to implement a data quality assurance procedure for collecting clean clinical data. With both SAS macros and SQL procedures used, the results will demonstrate the following benefits of this validation program: easy maintenance with no hard-coded checks, transparency of error-check edits to the users, clear and easily-understood reporting system, and considerable reduction of the total amount of programming time. WHAT IS DATA VALIDATION? Data validation is a process that helps us qualitatively control our data. It mostly involves the identification of faulty records in your database. Validation can establish documented evidence providing a high degree of assurance that a specific computerized process will consistently produce good quality (i.e., accuracy, consistency, completeness, integrity and precision) resultant data sets meeting your predetermined specifications. WHY VALIDATION? Here is an example of the problem data set in database: Gender Height DOB F 158 12/24/1967 M 175 10/29/2975 M 6.2 7/12/1983 F 164 167 8/12/1974 It s easy to see that two faulty records are presented: an incorrect input of DOB in row 2 and the wrong Height unit applied in row 3. In addition, some of the required values like Gender and DOB are left blank. However, these incorrect and faulty data should be identified and corrected or even removed from the patient record, so that they do not affect the management s decision making while analyzing and reporting the resulting data. Decision-support system that operates on incorrect and faulty data will create incorrect statistical results and generate faulty advice. Therefore, the

implementation of a good validation system in the database is necessary and essential to ensure that the data records are clean, accurate and complete. With the proper validation system implemented, you can: Make sure that the raw data are accurately entered into a computer readable data set. Check that character variables contain only valid values, such as M and F for gender field. Check that numeric variables are within predefined ranges. This will ensure that data outsides of the permitted range are not allowed; for example an age of 150 and the systolic blood pressure is not between 80 and 200. Examine if there are missing values for variables where complete data is necessary. Check for and eliminate duplicate data entries. For instance, two records with the same patient ID and the date of hospital visit might constitute an error. Overlook the uniqueness of certain values, such as patient ID number. Check for invalid date values, such as 12/32/2002, 14/07/1999 or 11/25/9999. Define logic checks and cross-field checks. For instance, if the question Is the patient currently taking anti-hyperglycemic medications? was answered with Yes, then the following fields of agent, dose, units, and frequency for all medications used should not be left blank, and vice versa. Verify that more complex multi-file rules, such as cross-form check or consistency check, have been followed. For example, if a serious adverse event (SAE) with an assigned toxicity code occurs in one data set (e.g. toxicity form), an observation with the same toxicity code will be expected in another data set (e.g. SAE report). In addition, the date of this event observation must be after the start date and before the end date of the study trial or follow-up time period. Furthermore, it is definite that a patient having a hysterectomy or an ovarian cancer could not be recorded as M or male. DESIGN OF DATA VALIDATION I have developed 2 validation programs: one was coded in SAS environment (see APPENDIX), and the other one was coded with Microsoft Access Visual Basic for Application (VBA). These two codes have the same program schema and program flow, but a different syntax and statement. In this Data Validation section, the same programming schema is simply illustrated below: 1. CREATE A MASTER TABLE First, create a table or data set called Master containing all the study SQL tables (such as PhysicianReport form, VitalSigns form, ) that you want to include in the validation checks. The following is what the data set looks like: MainTableName SQLTableName STableName dbo_vcrossmatch dbo_vcrossmatch svcrossmatch dbo_veligibilitychklist dbo_veligibilitychklist sveligibilitychklist dbo_vfollowup dbo_vfollowup svfollowup dbo_vphysicalexam dbo_vphysicalexam svphysicalexam dbo_vphysicianreportmaster dbo_vphysicianreportillness svphysicianreportillness dbo_vphysicianreportmaster dbo_vphysicianreportmaster svphysicianreportmaster dbo_vphysicianreportmaster dbo_vphysicianreportmeds svphysicianreportmeds

2. ADD VALIDATION RULES Create programming specifications that take the requirements (e.g. patient s age at the time of the transplant = age difference between TransplantDate and BirthDate) and translate them into programming and database terminology (i.e. AgeAtTransplant = INT((TransplantDate- BirthDate)/365.25))). The calculated Age field is derived from the DOB (Date of Birth) field. Similarly, this English-Programming language translation can be applied to AgeAtDiagnosis, AgeAtOnstudy and all other fields on study forms. After completing the specification table, these requirements should be exposed to and reviewed by the users (in our case, they are study protocol coordinators or clinical research associates - CRAs), to verify whether the requirements are interpreted correctly or not. This process can limit or avoid the that is NOT what I need misinterpretation syndrome, where the users see the misinterpreted final output that does NOT quite address their interest and needs. Corresponding to each row in the above Master table, a S-table with SQL Where Condition and Error Message is also created by inputting the logical or conditional row based on the criteria study protocol coordinator or CRA defines and protocol rules. The sample S-table may look like this: TableName FieldName WhereCond ErrorMsg InitialInfo TypeIDiabYN <> 1 Question type 1 diabetes was answered "No" InitialInfo DOB Is Null DOB field should not be blank InitialInfo HtFt <> Null AND HtFt Not Between 3 and 7 HeightFeet value should be 3~7 InitialInfo HomePhone <> Null AND LEN(HomePhone) <> 12 Incorrect HomePhone value length InitialInfo Gender <> Null AND Gender Not IN('M', 'F') Gender value was not 'M' or 'F' PhysicianReport Cpeptide <> Null AND CPeptide > 0.2 Fasting C-Peptide must be less than or equal to 0.2 3. PROGRAM LOGIC There are total 3 nested iterative loops in the program, which are presented as follows. DO (looping through each row in Master table) Statements Choose a SQL table name from Master table END DO DO (looping through each row in S-table) Statements Select a row (by row) with SQL statement from S-table END DO DO (looping through each row in target table) TESTS SQL table one record at a time Execute PROC SQL with WHERE Condition in S-table If returns 1, inserting errors into Errors table. If returns 0, it has passed the validation check. END DO In the innermost loop, the SQL statement is composed of 3 fields (i.e., in square brackets) from the S-table, which may look like:

SELECT COUNT(*) FROM [TableName] WHERE [FieldName] [WhereCond] This standard SELECT SQL statement will be executed one row at a time in this innermost loop. 4. RUN CODE AND OUTPUT RESULTS When SQL query returns value 1, it means that the tested row fails to pass the validation check. Thus, an error record will be appended into the Errors table. An example of this output table is shown below: ID TableName FieldName ErrorMsg 28530 InitialInfo HomePhone Incorrect HomePhone value length 28530 InitialInfo E_Mail E-mail address should include '@' character 28530 InitialInfo SSN SSN value's length is incorrect 1061 InitialInfo HtFt HeightFeet value was left blank 92819 PhysicianReport Cpeptide Fasting C-Peptide must be less than or equal to 0.2 9053 PhysicianReport Albumin Albumin value should be greater than 3.5 g/dl 62683 InitialInfo TypeIDiabYN Question type 1 diabetes was answered "No" 5. ERROR TRAPPING These invalid data entries, failure to pass the validation check, are rejected with clear-predefined and user-friendly error messages (as shown at the last column in the above table) and also trigger a subroutine to update a Flag variable s value in SQL table to a negative number, such as 1. If data rows pass the validation check, the FLAG value will become a positive number, such as 1. Thus, you will clearly see the pass/fail identifiers for all the records in tested table. Also, you can easily extract the good records by selecting only the Flag=1 rows. 6. DISTRIBUTING AND REVIEWING ERROR REPORTS The program developer and database administrator should check the correctness of computed results before sending out the error report. An easily understood and clearly organized error report can be routed back to the persons who collected and/or input the data. They may refer to the original form information to validate the computerized data. PROGRAM DEVELOPMENT LIFE CYCLE (PDLC) The development of this validation program can be broken down into six steps: 1. Gather program requirements via interaction between users and programmer. 2. Create programming criteria specifications and validate the SQL query statements. 3. Program development and testing. 4. Executing the validation programs and printing error reports. 5. User review with acceptance or rejection comments. 6. Program integration testing and reporting. Therefore, all these validation processes can be summarized with a simple formula:

Validation = Specification + Programming + Testing + Reporting + Reviewing + Documentation ADVANTAGES The benefits of this validation program: Programs are in a highly structured and systematic manner. Compared with the conventional procedure programming, this program can reduce the total amount of programming time needed. For instance, if users want to add more criteria to validate data, database administrator or programmer just needs to add logical or conditional rows in S-table. The validation program can leave intact, and the only thing we have to do is to fill in all the specific requirements in the S-table. Easy maintenance, just updating the S-table (creating SQL specifications that take the requirements from users and translate them into database terminology and SQL syntax) to maintain the high quality data. Automate data-reporting routine tasks, generating periodic error reports for users to review the faulty records. Flag values added in tested tables could play an important role when extracting data for final statistical analysis. Assists code development more efficiently that program has been designed, from the beginning to the end, to produce the correct and desired results. Reusable validated code. This code block can be repeatedly applied as the basic check and used by all general cases (in our case, used by a majority of studies, such as patient profile, demographic information, vital signs, concurrent medications, safety measurement for adverse events and laboratory data). CONCLUSION This presentation has demonstrated a method of designing and constructing a highly structured and clearly-defined validation program. The validation programs I developed can be applied to implementing a data quality assurance procedure for collecting clean clinical trial data in addition to creating the appropriate data files for further statistical analysis. However, it can also be utilized and expanded to your database system. ACKNOWLEDGMENTS The author would like to take this opportunity to acknowledge all of the assistance and support given to him by Drs. Fouad Kandeel, Jeff Longmate, David Ikle and Joyce Niland at City of Hope National Medical Center. Again, special thanks go to Dr. Longmate and Dr. Kandeel for stimulating his interest and many fruitful discussions in this study project. The author also expresses his appreciation to his colleague, Lorraine Lesiecki, for correcting the errors during the writing of this paper. TRADEMARKS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration.

Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Contact the author at: Keh-Dong Shiang, Ph.D. Department of Biostatistics & Department of Diabetes City of Hope National Medical Center 1500 East Duarte Road Duarte, CA 91010-3000 Work Phone: (626)256-4673 Ext. 65768 Fax: (626)471-7106 E-mail: kshiang@coh.org Web: www.coh.org APPENDIX %MACRO VarCheck(VarName, DsName); %LOCAL DsId VIndex Rc; %LET DsId = %SYSFUNC(OPEN(&DsName, IS)); %IF &DsId EQ 0 %THEN %DO; %PUT ERROR: (VarCheck) The data set "&DsName" is not found; %ELSE %DO; %LET VIndex = %SYSFUNC(VARNUM(&DsId, &VarName)); %LET Rc = %SYSFUNC(CLOSE(&DsId)); &VIndex %MEND VarCheck; %MACRO Validation; %LOCAL MasterTotalRow; %LOCAL WhTotalRow; %LOCAL IdTotalRow; %LOCAL DataOK; DELETE * FROM Mylib.Errors; SELECT COUNT(*) INTO :MasterTotalRow FROM Mylib.Master; %DO I = 1 %TO &MasterTotalRow; DATA _NULL_; SET Mylib.Master (FIRSTOBS=&I OBS=&I); CALL SYMPUT('TableID', Memname); CALL SYMPUT('TableName', TblName); CALL SYMPUT('WhTblName', WhTblName); ALTER TABLE Mylib.&TableID ADD Flag INTEGER; PROC COPY IN=Mylib OUT=Work; SELECT &TableID;

SELECT COUNT(*) INTO :WhTotalRow FROM Mylib.&WhTblName; %DO J = 1 %TO &WhTotalRow; DATA _NULL_; SET Mylib.&WhTblName (FIRSTOBS=&J OBS=&J); CALL SYMPUT('Field', FieldName); CALL SYMPUT('WhereCondition', WhereCond); CALL SYMPUT('ErrorMessage', ErrorMsg); %LET wh = WHERE &Field &WhereCondition; SELECT COUNT(*) INTO :IdTotalRow FROM Mylib.&TableID; %DO K = 1 %TO &IdTotalRow; DATA Mylib.Temp; SET Mylib.&TableID; IF _N_ = &K THEN DO; CALL SYMPUT('TrackingNo', TRIM(LEFT(PUT(TrackingNo, 8.)))); CALL SYMPUT('varFlag', TRIM(LEFT(PUT(Flag, 8.)))); OUTPUT; END; SELECT COUNT(*) INTO :DataOK FROM Mylib.Temp &wh; %IF &DataOK > 0 %THEN %DO; INSERT INTO Mylib.Errors (TrackingNo, RowID, TimeStamp, FormName, TableID, FieldName, ErrorMsg) VALUES(&TrackingNo, &RowID, "&TimeStamp", "&TableName", "&TableID", "&Field", "&ErrorMessage"); UPDATE Mylib.Temp SET Flag = -1; %IF &DataOK LE 0 %THEN %DO; UPDATE Mylib.Temp SET Flag = 1; DATA _NULL_; SET Mylib.Temp; CALL SYMPUT('TempFlag', TRIM(LEFT(PUT(Flag, 8.)))); DATA Work.&TableID; SET Work.&TableID; IF (_N_ = &K AND &TempFlag = -1) THEN Flag = -1; DATA Work.&TableID; SET Work.&TableID; IF Flag =. THEN Flag = 1; %MEND Validation; %Validation;