THE DATA DETECTIVE HINTS AND TIPS FOR INDEPENDENT PROGRAMMING QC DATE PhUSE 2016 PRESENTED BY Bethan Thomas
What this presentation will cover And what this presentation will not cover What is a data detective? Writing a validation program Maintaining program independence Identifying data discrepancies Helpful hints and tips Other considerations QC process Complete step-by-step guide 2
What is a Data Detective? And what is independent double programming? Being a programmer often feels like you re a detective Solving problems Identifying root causes Independent double programming Two programmers, one aim A method of thoroughly checking outputs and achieving high quality When those two programmers differ You have two suspects! A detective is needed to solve the mystery! 3
Writing an independent program And maintaining that independence Make no assumptions The man in the balaclava may not have robbed the bank! The man in the smart suit may not be innocent! Create a safety net but don t duplicate work Ensure great programming practice Use all relevant documentation Familiarise yourself with Protocol, CRF, SAP, IGs Refer to them regularly and whenever in doubt Maintain Independence Be an unbiased detective Do not view each other s programs use %INCLUDE Discuss don t dictate 4
Detecting Data Discrepancies 1 Differing order variables Matching numbers of observations No obvious pattern of mismatching observations Mismatching on most variables Both programmers to check key variables 5
Detecting Data Discrepancies 2 Differing order variables or order variables differing? Possibly due to differing order variables E.g. one is using AVISITN, the other ADT or VISITNUM E.g. one is using PARAM, the other is using PARAMCD Possibly differing values of order variables E.g. VISITNUM numbered differently for unscheduled visits Pattern or pairing in mismatching rows 6
Detecting Data Discrepancies 3 Using source data and documentation 1 Aim of independent double programming is not simply for data to match but to be correct. Data should be an accurate reflection of source and conform to necessary formats. Example 1: AVISIT mapping of unscheduled visits when ADPE specification states, Populate for scheduled assessments. Identical except QC has populated AVISIT with Visit 3, whereas the primary dataset has AVISIT set to null in equivalent records. Reference schedule of assessments. Visit 3 is scheduled, however it is not planned to perform a Physical Examination at this visit. Validation programmer populated AVISIT in all cases unless the value of VISITNUM indicated an unscheduled visit (e.g. VISITNUM=4.01), Primary programmer only populated this where a Physical Exam was specifically scheduled. Check SAP to see if it provides more detail on how it classifies unscheduled visits and how they should be handled for analysis 7
Detecting Data Discrepancies 3 Using source data and documentation 2 An example from SDTM. The snapshots below come from Main and QC datasets for a Biospecimen Events (BE) domain. Gene Expression on 8 th January is in the main dataset but is not present in QC. 8
Detecting Data Discrepancies 3 Using source data and documentation 2 Refer to raw data Refer to CRF 9
Inside the detective s toolkit The FREQ procedure When observation counts differ, it can be difficult to know where to start looking. Calculate frequencies by a variable(s) and use it (them) in the ID statement of the PROC COMPARE. Good Choices Test/parameter Visit/timepoint Subject DTYPE/PARAMTYP Grouping Qualifiers Poor Choices Sequence number Date/day Flags Free text Results Add further by variables or subset the data to narrow down to an issue that can be investigated. Also useful to check mappings of coded variables 10
A QC Program and a program to QC Keep it separate There are lots of programmatic ways of identifying discrepancies and their causes: - Subset (variables or records) - Re-sort - Calculate frequencies - Modify data Keep these in a separate program Use temporary datasets do not overwrite data. 11
The LISTALL option And a warning about ID variables The LISTALL option list observations or variables present in one dataset but not the other, as well as comparing observations present in both datasets Coupled with ID variables, this is particularly helpful. E.g. comparing counts by PARAMCD. The output might state that PARAMCD= SYSBP is only found in the Main dataset. If the LISTALL option is used without ID variables, the output would simply state that the last observation is found in the Main dataset only, and not point to the specific parameter. If ID is used without LISTALL, the following misleading output can appear: 12
SAS Shortcut Keys As the same techniques can be used for qc-ing any kind of dataset, you can save time by creating a SAS shortcut or abbreviation. This is very straightforward but different depending on the version of SAS, check out SAS help for details. In SAS Enterprise guide go to the Program menu and into Editor Macros In older versions of SAS this can be found in the Tools menu and into Keyboard macros data qc; data main; set adam.; subject= scan(usubjid,2,'-') '-' scan(usubjid,3,'-'); set qadam.; subject= scan(usubjid,2,'-') '-' scan(usubjid,3,'-'); * if; * if; * where; * where; * keep; * keep; * drop; * drop; run; run; /*proc sort data=main;*/ /* by ;*/ /*run;*/ /*proc freq data=main noprint;*/ /* table subject / out=main (drop=percent);*/ /*run;*/ /*proc sort data=qc;*/ /* by ;*/ /*run;*/ /*proc freq data=qc noprint;*/ /* table subject / out=qc (drop=percent);*/ /*run;*/ proc compare base=main compare=qc listall; /* id subject ;*/ run; 13
Some final hints and tips Remember to visually check for obvious anomalies and to avoid rare cases where both programmers make identical mistakes. Check for: Truncation Missing data Implausible values Incorrect mapping from source Use the relevant validation checkers Compile a QC checklist covering tasks and checks required for each output type (SDTM, ADaM, TFL) to ensure thoroughness and consistency. Add study-specific checks to list if necessary. Continually refer to documentation (Protocol, CRF, SAP, shells, CDISC documentation). 14