PharmaSUG Paper AD09

Similar documents
Harnessing the Web to Streamline Statistical Programming Processes

One Project, Two Teams: The Unblind Leading the Blind

Programming checks: Reviewing the overall quality of the deliverables without parallel programming

SAS Application to Automate a Comprehensive Review of DEFINE and All of its Components

TLF Management Tools: SAS programs to help in managing large number of TLFs. Eduard Joseph Siquioco, PPD, Manila, Philippines

Automated Checking Of Multiple Files Kathyayini Tappeta, Percept Pharma Services, Bridgewater, NJ

The Ins and Outs of %IF

Why SAS Programmers Should Learn Python Too

Making the most of SAS Jobs in LSAF

LST in Comparison Sanket Kale, Parexel International Inc., Durham, NC Sajin Johnny, Parexel International Inc., Durham, NC

Customized Project Tracking with SAS and Jira

Keeping Track of Database Changes During Database Lock

Automate Clinical Trial Data Issue Checking and Tracking

Hypothesis Testing: An SQL Analogy

Automation of makefile For Use in Clinical Development Nalin Tikoo, BioMarin Pharmaceutical Inc., Novato, CA

Making a List, Checking it Twice (Part 1): Techniques for Specifying and Validating Analysis Datasets

Easy CSR In-Text Table Automation, Oh My

How to validate clinical data more efficiently with SAS Clinical Standards Toolkit

CDISC Variable Mapping and Control Terminology Implementation Made Easy

Automated Creation of Submission-Ready Artifacts Silas McKee, Accenture, Pennsylvania, USA Lourdes Devenney, Accenture, Pennsylvania, USA

The Benefits of Traceability Beyond Just From SDTM to ADaM in CDISC Standards Maggie Ci Jiang, Teva Pharmaceuticals, Great Valley, PA

CC13 An Automatic Process to Compare Files. Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

PharmaSUG China 2018 Paper AD-62

Using UNIX Shell Scripting to Enhance Your SAS Programming Experience

ABSTRACT INTRODUCTION WHERE TO START? 1. DATA CHECK FOR CONSISTENCIES

SAS Drug Development Program Portability

Harmonizing CDISC Data Standards across Companies: A Practical Overview with Examples

Guidelines for Coding of SAS Programs Thomas J. Winn, Jr. Texas State Auditor s Office

3N Validation to Validate PROC COMPARE Output

Experience of electronic data submission via Gateway to PMDA

Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

TLFs: Replaying Rather than Appending William Coar, Axio Research, Seattle, WA

esubmission - Are you really Compliant?

The Output Bundle: A Solution for a Fully Documented Program Run

The Impossible An Organized Statistical Programmer Brian Spruell and Kevin Mcgowan, SRA Inc., Durham, NC

Interactive Programming Using Task in SAS Studio

PharmaSUG Paper PO22

ABSTRACT: INTRODUCTION: WEB CRAWLER OVERVIEW: METHOD 1: WEB CRAWLER IN SAS DATA STEP CODE. Paper CC-17

Missing Pages Report. David Gray, PPD, Austin, TX Zhuo Chen, PPD, Austin, TX

Combining TLFs into a Single File Deliverable William Coar, Axio Research, Seattle, WA

PharmaSUG Paper TT11

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

Migration to SAS Grid: Steps, Successes, and Obstacles for Performance Qualification Script Testing

Traceability in the ADaM Standard Ed Lombardi, SynteractHCR, Inc., Carlsbad, CA

Checking for Duplicates Wendi L. Wright

Efficiently Join a SAS Data Set with External Database Tables

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

Power Data Explorer (PDE) - Data Exploration in an All-In-One Dynamic Report Using SAS & EXCEL

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

ADaM Compliance Starts with ADaM Specifications

SAS Visual Analytics Environment Stood Up? Check! Data Automatically Loaded and Refreshed? Not Quite

ET01. LIBNAME libref <engine-name> <physical-file-name> <libname-options>; <SAS Code> LIBNAME libref CLEAR;

Best Practice for Creation and Maintenance of a SAS Infrastructure

Advanced Visualization using TIBCO Spotfire and SAS

Hidden in plain sight: my top ten underpublicized enhancements in SAS Versions 9.2 and 9.3

Patricia Guldin, Merck & Co., Inc., Kenilworth, NJ USA

How Managers and Executives Can Leverage SAS Enterprise Guide

Prove QC Quality Create SAS Datasets from RTF Files Honghua Chen, OCKHAM, Cary, NC

Streamline SDTM Development and QC

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

SAS Macro Dynamics - From Simple Basics to Powerful Invocations Rick Andrews, Office of the Actuary, CMS, Baltimore, MD

SQL Metadata Applications: I Hate Typing

A SAS Macro to Create Validation Summary of Dataset Report

Utilizing SAS for Cross- Report Verification in a Clinical Trials Setting

Legacy to SDTM Conversion Workshop: Tools and Techniques

Use That SAP to Write Your Code Sandra Minjoe, Genentech, Inc., South San Francisco, CA

A Mass Symphony: Directing the Program Logs, Lists, and Outputs

A Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA

The Submission Data File System Automating the Creation of CDISC SDTM and ADaM Datasets

Paper ###-YYYY. SAS Enterprise Guide: A Revolutionary Tool! Jennifer First, Systems Seminar Consultants, Madison, WI

Building Sequential Programs for a Routine Task with Five SAS Techniques

Dictionary.coumns is your friend while appending or moving data

When Powerful SAS Meets PowerShell TM

Quick and Efficient Way to Check the Transferred Data Divyaja Padamati, Eliassen Group Inc., North Carolina.

Data Integrity through DEFINE.PDF and DEFINE.XML

Preparing the Office of Scientific Investigations (OSI) Requests for Submissions to FDA

Greenspace: A Macro to Improve a SAS Data Set Footprint

SAS ENTERPRISE GUIDE USER INTERFACE

Your Own SAS Macros Are as Powerful as You Are Ingenious

Give me EVERYTHING! A macro to combine the CONTENTS procedure output and formats. Lynn Mullins, PPD, Cincinnati, Ohio

A Macro to Create Program Inventory for Analysis Data Reviewer s Guide Xianhua (Allen) Zeng, PAREXEL International, Shanghai, China

Useful Tips When Deploying SAS Code in a Production Environment

Symbol Table Generator (New and Improved) Jim Johnson, JKL Consulting, North Wales, PA

Preserving your SAS Environment in a Non-Persistent World. A Detailed Guide to PROC PRESENV. Steven Gross, Wells Fargo, Irving, TX

Submitting SAS Code On The Side

Using UNIX Shell Scripting to Enhance Your SAS Programming Experience

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

Getting Started Guide

Run your reports through that last loop to standardize the presentation attributes

What is the ADAM OTHER Class of Datasets, and When Should it be Used? John Troxell, Data Standards Consulting

Out of Control! A SAS Macro to Recalculate QC Statistics

Cutting the SAS LOG down to size Malachy J. Foley, University of North Carolina at Chapel Hill, NC

Best Practices for Loading Autodesk Inventor Data into Autodesk Vault

Data Manipulation with SQL Mara Werner, HHS/OIG, Chicago, IL

Making do with less: Emulating Dev/Test/Prod and Creating User Playpens in SAS Data Integration Studio and SAS Enterprise Guide

Paper CC-013. Die Macro Die! Daniel Olguin, First Coast Service Options, Jacksonville, FL

How to write ADaM specifications like a ninja.

Using Metadata Queries To Build Row-Level Audit Reports in SAS Visual Analytics

Program Validation: Logging the Log

Transcription:

PharmaSUG 2015 - Paper AD09 The Dependency Mapper: How to save time on changes post database lock Apoorva Joshi, Biogen, Cambridge, MA Shailendra Phadke, Eliassen Group, Wakefield, MA ABSTRACT We lay emphasis on building a perfect roadmap to database lock. To name a few, we keep all stake-holders in-sync, nail down the specs and simulate dry-runs pre-database lock. This helps us produce the final deliverable such as tables, listings, figures and data sets in a seamless fashion post database lock. In our race to create these deliverables in record time, we usually forget to account for one aspect human error. Changes to any part of the deliverable post database lock can be frustrating and could potentially delay timelines. What could be the impact of a small change, such as correcting a label typo in ADLB to the impact of something more severe, such as a new derivation in ADSL? By creating a dependency map an extensive map of data sources and their dependent programs one can greatly reduce time consumed on changes post database lock. This paper explores the various advantages of creating a dependency map and explores different methods to create such maps programmatically. With sample user cases the various time saving advantages of dependency maps are explained in detail. While detailed code is not provided in this paper, important snippets of code necessary for someone to get started are provided. INTRODUCTION This paper provides a simple approach to many a time complicated process of identifying and managing changes post database lock. Before jumping in, the reader would do well to familiarize himself with database lock procedures and have some familiarity with SAS programming, an understanding of SDTM and ADaM data sets and how tables, listings and figures reference these data sources. The following figure gives a high level overview of the basic concept of this paper Figure 1. Basic Concept 1

THE BASIC CONCEPT Imagine a 100% validated deliverable comprised of four ADaM data sets and hundred tables, listings and figures (TLF) as shown in Figure 2. Figure 2. Dependencies within Deliverable Each data set references ADSL and various TLF s reference various ADaM data sets, creating a web of dependencies. We asked a random sample of programmers and statisticians how much time it would take to rerun and QC the above deliverable, assuming no data changes. We received answers from 2 hours to 2 days, based on type of QC techniques involved. NOTHING IS CONSTANT, BUT CHANGE. Once database locks, we rerun and QC our deliverable and everything looks great. Production and QC programmers high five each other and are ready to go home by 5 in the evening. Welcome to our imaginary world. Raise your hand if you have ever been in the following situation. Database has locked, data sets have passed QC, TLF s have been run and QC passed and sent over for senior review. Then you receive an email saying a variable derivation needs to be changed in ADLB because of a data issue which was not identified pre-database-lock. While the programming for this derivation modification might not be time consuming, the ordeal that follows is every programmer s nightmare. Once data sets have been updated and usually one would like to rerun all data sets to maintain similar timestamps all the TLF s need to be updated and passed through QC. We asked the same set of statisticians and programmers to estimate time taken to modify, rerun and QC the deliverable after the change to ADLB. The two hours two days range had changed to two days to a week. Most of the pool we interviewed believed the entire deliverable should be rerun and all TLF s should be QC passed. This got us thinking that any change post database lock is beyond our control. What can be controlled are which parts of the deliverable need to be passed through QC and which can purely be rerun without worrying about the change to ADLB affecting them. If out of the 100 TLF s, only 15 reference the ADLB data set, why should we be worried about checking the remaining 85? DEPENDENCY MAPPING A dependency mapper identifies dependencies within a deliverable, namely the various data sets and output TLFs. If we can get a comprehensive map identifying the dependency between every TLF and data set, we could limit our QC activities to programs which reference ADLB. Many work under the assumption that a change in one data set requires QC of the entire deliverable. But if one can prove that the change affects only a subset of the deliverable we could avoid QC of the entire deliverable and cut down turn-around time. We explored various methods of dependency mapping as listed below 1. Build a parser which scans the log files for any library name (LIBNAME) references and identify data sets which are referenced through these libraries. 2. A slight variation of the above method. Create a macro which is used in the programs to read any data set. Then create a parser which scans for certain keywords which are displayed in the log each time the macro is invoked. Text which follows these keywords is the data set name and will be read in by the parser to create a dependency map. 3. Manually scan through each program and create a list of referenced data sets. 2

4. Ask programmers to keep track of data sets referenced as they program an output and create a consolidated list of programs and the data sets referenced. Each of the options above has its pros and cons. The third option is extremely labor intensive, prone to human error and not worth spending any time considering. The fourth option, while quick and requiring least amount of application development, is highly prone to human error and could turn out to be a maintenance nightmare. The second option is close to fool-proof but might be prone to human error (like directly referencing the data set without the macro call). The first option is fool-proof but requires taking care of large number of permutations and combinations during application development. While a programmer might choose any of the techniques above, we decided to focus our efforts on the first option. IMPLEMENTATION Discussed below is the implementation of the first method, which is, build a parser which scans the log files for any LIBNAME references and identifies data sets which are referenced through these libraries. We develop a macro with an option to input the following parameters a. Path name of the deliverable location for which you want to check for data dependencies. It is important that the log files of the programs are present in this location. b. Path name of the folder location where you want to save the output excel file displaying all data dependencies. c. List of different libraries from where data is being used. This particular option is optional as the program will automatically read all library references. d. List of the outputs (data sets and TLFs) for which you want to check the data dependencies. This is option but can come in handy when one would want to check a specific set of the deliverable. ACCESS INPUT FOLDER AND CREATE LIST OF SAS FILES Using the following code, we access the input folder defined in the macro call and read in the log file names. DATA dsn(keep=nfile nnext); /*inpath is the macro parameter with the path of the deliverable Assign filref for the directory given by &inpath*/ rc=filename('dir',"&inpath"); dirid=dopen('dir'); /*numsel will have number of members in a directory*/ numsel=dnum(dirid); DO i=1 TO numsel; /*nlst will have each member name with the extension*/ nlst=dread(dirid,i); nfile=scan(nlst, 1, '.'); nnext=upcase(scan(nlst, 2, '.')); /*if the member is log file then output*/ IF INDEXW(upcase( log ),upcase(nnext)) THEN OUTPUT; /*after each log file is output in dsn then close the directory*/ rc=dclose(dirid); RUN; As an example, to better explain the output of the above code snippet, if the deliverable folder contains few SDTM program log files, ADaM program log files and table program log files, the dsn data set will be as follows. Nfile DM AE ADSL ADAE T-DM-DEMOG T-AE-SOC T-AE-SOC-PT nnext Display 1: Contents of DSN Data Set Using PROC SQL save the nfile variable values in a macro variable separated by and scan each value and use it in the INFILE statement to read the log file. 3

PROC SQL noprint; SELECT nfile INTO: prgname separated by ' ' FROM dsn; SELECT count(distinct nfile) into :cnt FROM dsn; quit; The prgname macro variable will have the value as shown below in Output 1 DM AE ADSL ADAE T-DM-DEMOG T-AE-SOC T-AE-SOC-PT Output 1: Contents of prgname Macro Variable The cnt macro variable will have the value 7. The following code will select each log file and read it into a SAS data set using INFILE statement. %DO i = 1 %TO &cnt; %LET log = %SCAN(&prgname,&i,' '); %LET logfile = &inpath./&log..log; DATA logscan_&i.; INFILE "&logfile." DSD; LENGTH item $500; INPUT item; RUN; % SCAN FILES TO CHECK FOR DEPENDENCIES Before scanning the log file we should know the different library references used for the input data sets. If each programmer defines his own library references, we can first scan for the note generated by the LIBNAME statement to check for the defined library references. When a library reference is defined (ex: libname LIBDEMO "/server/drug/study/deliverable"), the following note will be seen in the log. NOTE: Libref LIBDEMO was successfully assigned as follows: Output 2: LIBNAME Statement Resolved in log. By using a combination of the INDEX and SCAN function we can capture each library reference. IF INDEX(item,"NOTE: libref") THEN DO; ref = SCAN(item,3); OUTPUT; Once we know the library references, use the INDEX function to find the string containing each library reference in the log and capture the substring containing the library reference followed by the data set name. This will be data set which is being used and should be stored as a reference. For example in the log file you might have the following note. NOTE: There were 160 observations read from the data set LIBDEMO.ADSL. Output 3: DATA Statement Resolved in log. IF INDEX(STRIP(ref) '.',item) THEN DO; dataused = SCAN(SUBSTR(item,INDEX(UPCASE(item), STRIP(ref) '.')),1,' '); OUTPUT; Above statement will first check if LIBDEMO. string is present in the log and will substring LIBDEMO.ADSL in dataused variable. LIBDEMO.ADSL is the data set which is being referenced. Once each log file is scanned for all the library references, the data sets referenced are output in a SAS data set. The output data set can be identified into analysis data sets, tables, listings and graph data sets for better visual representation. 4

DISPLAY OF FINAL RESULTS The final results are displayed in an excel file as shown below in the sample output below shown in Display 2. Date when Table program was executed Table SDTM/ADAM/Other Data referenced Nov 4 14:07 T-AE-SOC-PT LIBDEMO.ADAE, LIBDEMO.ADSL Nov 4 14:07 T-AE-SOC LIBDEMO.ADAE, LIBDEMO.ADSL Nov 4 14:49 T-DEMOG LIBDEMO.ADSL Nov 4 14:07 ADSL LIBDEMO.DM Nov 6 11:01 ADAE LIBDEMO.ADSL, LIBDEMO.AE Display 2: Sample Output Identifying Dependencies for the Deliverable You can also display the date and time when each output was created to check if the order of execution was correct. In the above example ADAE was executed after the tables which would be in violation of standard QA practices. CONCLUSION Dependency mapper provides a distinct time saving advantage when we have to rerun and QC changes post database lock. It can also provide additional checks such as ensuring all data is being mapped, chronological check of data and TLF reruns and supporting documentation for program TOC. ACKNOWLEDGMENTS The authors would like to thank Daniel Boisvert and Matthew Carlson from Biogen for their time and input in brainstorming the idea of a dependency mapper and providing the required support in implementation of this concept. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Apoorva Joshi Sr. Analyst III, Statistical Data Analytics Biogen 300 Binney St, Suite 94W26 Cambridge, MA 02142 Work Phone: 617.914.6723 E-mail: apoorva.joshi@biogen.com Shailendra Phadke Statistical Programming Contractor Eliassen Group 30 Audubon Rd, Wakefield, MA 01880 Work Phone: 617.914.4132 E-mail: sphadke@eliassen.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 5