ABSTRACT Quality Control of Clinical Data Listings with Proc Compare Robert Bikwemu, Pharmapace, Inc., San Diego, CA Nicole Wallstedt, Pharmapace, Inc., San Diego, CA Checking clinical data listings with proc compare is a quick way to validate the order, completeness, and content of the listings. This is valuable and efficient when working with hundreds or thousands of pages per listing. Here we introduce the setup steps which need to be taken by both developer and quality control (QC) programmer, and suggested steps a QC programmer can take when validating large listings. INTRODUCTION The QC process for listings require verifying the number of records, assuring the accuracy of the data sorting order and content, and checking the formats, column headers, titles, and footnotes. The first two aspects can be accomplished using the steps described in this paper; the third will still have to be done by manual review. Reviewing the first page of the output mostly accomplishes the third QC aspect. As a result, the use of proc compare for the remaining data verification has the potential to dramatically reduce the QC effort for listings. In this paper, we will review the roles for the developer and QC programmer and finish with an example using SASHELP.CLASS dataset. LISTING DEVELOPER S ROLE To implement these steps the developer of the listings will first need to define a folder location for the saved dataset using the LIBNAME statement. Second, the developer will have to save the dataset outputted in the listing using the new LIBNAME. This can be accomplished in two ways. One is to save the dataset with a data step, seen in Output 1. libname savelist "G:/dev/project4/listings/qc"; data savelist.l_16_1_dm; set ae4; Output 1. Save Dataset with a Data Step The other way is to simultaneously save the dataset while outputting the listing using proc report s built in OUT statement. The statement can be added to the first line of the proc report, see Output 2 below. Please note, Output 2 is preferred as it ensures it is the most accurate reflection of the listing (additional data manipulation is needed when using this option; see Number 3 in the Strategies to Prevent Common Discrepancies section). libname savelist "G:/dev/project4/listings/qc"; proc report data=sashelp.class out=savelist.l_16_1_dm; column sex name age height weight; Output 2. Save Dataset with PROC REPROT s built-in OUT Statement QC PROGRAMMER S ROLE The QC programmer s role is to independently generate a dataset to compare to the developer s dataset with proc compare. Before we use proc compare, there are a few steps, discussed below, that should be taken to ensure the datasets can be compared. STEPS TO COMBAT COMMON DISCREPANCIES 1. Open the listing before creating the QC dataset to get an understanding of the source datasets and data sort or display ordering.
2. Check that the date of the developer s dataset is consistent with the version you want to compare, and use the access=read-only option in the LIBNAME statement to prevent overwriting the developer s dataset. 3. Delete all non-missing _BREAK_ rows from developer s dataset when produced by proc report s OUT option. These are either to indicate summary rows or paging rows that contain retained data. 4. Review the source datasets against the listing and the developer s dataset to identify the variables used for the listing 5. Use proc compare to check if the developer used any formats to display data, as well as double check the length of the variables to confirm that it is long enough to fit the contents of the variables. 6. Use proc compare s ID statement to list all the variables needed to make each row unique i.e. sex, age, and name. 7. Either rename variables to match corresponding columns between developer s dataset and QC programmer s dataset or use proc compare s WITH statement to identify which variable in the QC programmer s dataset to compare to in the VAR statement s list of variables from the developer s dataset. 8. Ensure you are using the correct population and the correct merge statement. 9. Once the number of observations match then use proc compare and use the output as your guide. Find the first observation that is discrepant and pull up the datasets to compare side by side. POTENTIAL BLIND SPOTS WITH PROC COMPARE 1. MISSING VARIABLES: Confirm you are comparing the desired variables. 2. MISSING OBSERVATIONS: Before comparing datasets, confirm the count of both datasets match. 3. CONFLICTING TYPES: Different types for the same variable name may occur because of re-formatting to adjust order (e.g. SEXN and SEX in the example). 4. MISMATCHED ID VARIABLES: The ID statement in proc compare lists variable(s) on which to match each observation by and if the distributions of these variables are off, it can lead to problems. One solution is to use proc freq with the LIST option after the TABLES statement by the ID variables to see if the counts per strata match between datasets. EXAMPLES: SASHELP.CLASS The class dataset contains five variables; two character variables: sex and name, and three numeric variables: age, height, and weight. Table 1 is an example output of the dataset with Sex, Age, and Name as the unique ID for the listing. Sex Name Age Height (cm) Weight (lb) Male Thomas 11 57.5 85 Male James 12 57.3 83 Male John 12 59 99.5 Male Robert 12 64.8 128 Male Jeffrey 13 62.5 84 Male Henry 14 63.5 102.5 Male Alfred 14 69 112.5 Male William 15 66.5 112 Male Ronald 15 67 133 Male Philip 16 72 150 Female Joyce 11 51.3 50.5 Female Louise 12 56.3 77 Female Jane 12 59.8 84.5 Female Alice 13 56.5 84 Female Barbara 13 65.3 98 Female Carol 14 62.8 102.5 Female Judy 14 64.3 90 Female Janet 15 62.5 112.5 Female Mary 15 66.5 112 Table 1. DBLOAD Procedure: Default DB2 Data Types for SAS Variable Formats
Display 1 shows an example of a dataset made by a developer, and Display 2 shows the proc contents of that dataset. Something to note is there are three extra variables in Display 1 than displayed in table 1: SEXN, AGE1, and _BREAK_. Now we need to assess if they are used to support the outputted listing. First, _BREAK_ indicates that this was made by proc report and all rows that are blank in _BREAK_ need to be assured for accuracy (if _BREAK_ is not blank it is likely an indication of a paging row or summary row that contains repeated values of the previous row). AGE1 and AGE seem to be identical, but looking at proc contents (Display 2) we see AGE1 is the character form of AGE, and AGE is displayed because the label matches what is in the listing header (Table 1). In proc contents we also see that SEXN is the numeric version of SEX from display 1, and from Display 2 we see that SEX is formatted and the original form of SEX has a length of 1. This indicates that a format was used for the SEX variable to output Male and Female. Display 1. Former Main Interface for SAS Management Console Display 2. Former Main Interface for SAS Management Console
From the above information, we see that we need to use SEXN (the numeric form of SEX), AGE, HEIGHT, and WEIGHT to order the dataset. Again, AGE1 is not needed for sorting and is not in the output, so we don t need to compare this variable. Also, while NAME is part of the unique ID of the table (sex, age, name) it is not used for sorting. One way to check the distribution of the rows, is to run a proc freq by the strata you are interested in. In this case, we are interested in the distribution of SEX and AGE (We could use SEXN or AGE1 because they match their counterpart). Output 3 shows an example of how to setup proc freq to provide the output in Display 3. proc freq data=l_16_1_dm; table sex*age/list missing; Output 3. PROC FREQ to Check Distribution Between Developer s Dataset and QC s Dataset Display 3. List View of the PROC FREQ From Output 3 At this point, the counts and the distributions of the datasets match, the sorting is done, and we have the variables we want to compare. Now we can run the proc compare of the developer s dataset (l_16_1_dm) verse the QC programmer s dataset (qc1), as seen in Output 4. Again, if the QC programmer creates the variable names to match that of the corresponding variables in the developer s dataset then Output 4 will work. If not, you will have to add VAR and WITH statements to specify which variables match each other. An example of the result of Output 4 is provided in display 4 through 6. proc compare base=l_16_1_dm compare=qc1 list; id sexn age name; Output 4. PROC COMPARE Between Developer s Dataset and QC Programmer s Dataset
Display 4. Page 1 of Result of Output 4 Display 4 shows that both the developer s dataset and QC programmer s dataset have 19 observations. Also, it shows that the developer s dataset has 1 more variable then the QC programmer s. In display 5 we see that AGE1 is the variable that is missing and in display 6 we find that all compared variables are exactly equal. Display 4. Page 2 of Result of Output 4
Display 4. Page 3 of Result of Output 4 CONCLUSION Using proc compare can support the QC process for larger listings as well. All you need to do is follow these steps and be mindful that they only fulfill two parts of the QC process. First, they verify the number of records outputted from the original dataset. Second, they assure the accuracy of the order and contents of the records. However, these steps do not check the formats, column headers, titles, and footnotes which can be done manually by reviewing the first page of the listing. REFERENCES SAS Institute Inc. 2017. Base SAS 9.4 Procedures Guide, Seventh Edition. Cary, NC: SAS Institute Inc. https://support.sas.com/documentation/cdl/en/proc/70377/pdf/default/proc.pdf Horstman, Joshua, Muller, Roger. 2014. Don t Get Blindsided by PROC COMPARE. Proceedings of the 2014 SAS Global Conference. Washington, DC. Paper 1615-2014. http://support.sas.com/resources/papers/proceedings14/1615-2014.pdf Chen, Honghua. 2012. Prove QC Quality Create SAS Dataset from RTF File. Proceedings of the 2012 SAS NESUG Conference. Baltimore, MD. http://www.lexjansen.com/nesug/nesug12/ph/ph03.pdf Casas, Angelina. Proc Compare to Validate Datasets. Proceedings of the 2003 SAS PHARMASUG Conference. Miami, FL. http://www.lexjansen.com/pharmasug/2003/tutorials/tu056.pdf ACKNOWLEDGMENTS We would like to express our thanks to Jay Zhou, Xiaodong Li, and Dewei Li for reviewing this. Also for the support from Michelle Rossi, and Debby Smith as we worked on the beginning drafts. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Name: Robert Bikwemu Enterprise: Pharmapace, Inc.
Address: 10509 Vista Sorrento Parkway, Suite 303 City, State ZIP: San Diego, CA 92121 Work Phone: (858)263-0510 E-mail: Robert.Bikwemu@pharmapace.com Name: Nicole Wallstedt Enterprise: Pharmapace, Inc. Address: 10509 Vista Sorrento Parkway, Suite 303 City, State ZIP: San Diego, CA 92121 Work Phone: (858)263-0510 E-mail: Nicole.Wallstedt@pharmapace.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.