Tracking Dataset Dependencies in Clinical Trials Reporting Binoy Varghese, Cybrid Inc., Wormleysburg, PA Satyanarayana Mogallapu, IT America Inc., Edison, NJ ABSTRACT Most clinical trials study reporting involves creation of analysis datasets from raw data. Analysis datasets function as an intermediate point where complex computations are performed on raw data and stored for later use in tables, listings and graphs. Depending on the complexity of the study and the endpoints being analyzed, these datasets may have to be created using not only raw data but also other analysis datasets. If there are a large number of analysis datasets in a study, it is not only time consuming to manually document dependencies but may also present an area where human errors may lead to severe consequences. The purpose of this paper is to present a technique that will enable to automatically store meta information about datasets including their dependencies that can be used to ensure quality control checks. Some such checks are: output dataset has a time stamp later than input datasets being used for creation, datasets do not have circular dependency, etc. INTRODUCTION Most organizations use a hierarchical folder structure to organize datasets, tables, listings, graphs, programs and associated validation work. Some systems have a built-in framework that automatically tracks analysis to raw/analysis data dependency. If such a framework is not available, programmers can keep track of dependencies by manually documenting them. The dependency list would then determine the order in which analysis dataset programs will be run. Maintaining a dependency list is not only a laborious process that involves constant coordination between team members but also poses the risk of inadvertent human errors. If a considerable amount of analysis datasets are being created (eg. Phase II/Phase III trials), the risk and labor get compounded. Automatic tracking of dataset dependencies can be enabled by making modifications to the programming infrastructure and introducing %read and %write macro calls in the analysis dataset programs. The infrastructure changes and macros are discussed in detail in the subsequent sections. The complete SAS code for the macros (%read and %write) is provided in the paper and can be used as is, in most cases. FOLDER STRUCTURE AND METADATA INFORMATION Fig. 1 Typical folder structure Fig. 2 Modified folder structure 1
Fig.1 is a snapshot of a typical folder structure. Fig.2 shows the relative location where metadata folder is created. This folder will contain metadata information generated by the analysis dataset programs using macros %read and %write. Fig.3 is a snapshot of the contents of metadata folder. The dataset meta_a_dem.sas7bdat is automatically created by the analysis dataset program a_dem.sas. The convention followed in naming the metadata dataset is to append meta_ to the program name. Fig. 3 Sample contents of metadata folder Fig. 4 Data structure and sample information contained in meta_a_dem.sas7bdat Fig.4 shows the data structure and typical contents of the meta dataset. The attrib variable helps in identifying the input and output datasets and the program that created this information. The datetime variable has creation datetime information of the input and output datasets and the time at which the program is submitted to the SAS engine for execution. AN and EXTRACT are libnames defined in the autoexec.sas file used by analysis dataset programs. AUTOEXEC FILE The autoexec.sas file used by analysis dataset programs has to be modified to include a library reference meta to the metadata folder. The libname meta is used by %read and %write macros. libname meta "<path to metadata folder>"; 2
The sasautos option has to revised to point to the macro folder, if it is not already pointing to this location. The macro folder will contain read.sas and write.sas. options sasautos=("<path to macro folder>") mautosource; %READ MACRO The %read macro is used to read datasets from analysis and extracts data library to the work data library. The %read macro has 3 parameters. They are: LIB source library name (Required parameter) DSN input dataset name (Required parameter) OUT output dataset name (Optional parameter). If this parameter is not specified in the macro call, the dataset copied to the work library will have the same name as the input dataset. The algorithm used in the %read macro is divided into 2 parts: INITIALIZATION This part of the macro is executed only once during a SAS session. The tasks performed are: 1. Obtain <program name>. 2. Check if metadata dataset exists. If so, delete existing dataset. 3. Create dataset structure and output <program name>. 4. Save meta_<program name> dataset in the metadata folder REPETITION This part of the macro is executed at each macro invocation during a SAS session. The tasks performed are: 5. Read dataset from source data library and store in work data library 6. Obtain last modified datetime information and append to meta_<program name> dataset in metadata folder The complete sas code for the %read macro is listed below. Copy the code as is and save as read.sas in the macro folder. %macro read(lib=,dsn=,out=); %if %symexist(firstcall) eq 0 %then %do; %global progname; proc sql; select distinct scan(scan(trim(left(xpath)), -1, "\"),1,'.') into: progname from sashelp.vextfl where index(upcase(xpath),'.sas'); quit proc datasets library=meta nolist; delete meta_&progname; data meta.meta_&progname; length metadata attrib $100; format datetime datetime.; datetime=input("&sysdate:&systime",datetime.); metadata=upcase("&progname"); attrib='program NAME'; output; label metadata='meta data' attrib='attribute' datetime='date & Time' ; 3
%end; %if %symexist(firstcall) eq 0 %then %do; %global firstcall; %let firstcall=1; %end; %if &out= %then %let out=&dsn; data work.&out; set &lib..&dsn; ods listing close; ods output Attributes=meta._temp_attrib_&lib._&dsn; proc contents data=&lib..&dsn; ods output close; ods listing; data meta._temp_attrib_&lib._&dsn; length metadata attrib $100; format datetime datetime.; set meta._temp_attrib_&lib._&dsn; where compress(upcase(label1))='lastmodified'; metadata=upcase("&lib..&dsn"); attrib="input DATA"; datetime=nvalue1; keep metadata attrib datetime; proc datasets library=meta nolist; append base=meta.meta_&progname data=meta._temp_attrib_&lib._&dsn; delete _temp_attrib_&lib._&dsn; proc sql undo_policy=none; create table meta.meta_&progname as select distinct * from meta.meta_&progname; %mend read; %WRITE MACRO The %write macro call is made after the analysis dataset is created in the work data library. The %write macro stores the analysis dataset in the analysis data library. With regard to Fig.1 this location is mystudy\data\analysis. The library reference an pointing to this location must be defined the autoexec.sas file. The %write macro has only 1 required parameter which is the analysis dataset name to be stored. The %write macro does not have an initialization stage although it uses macro variable progname that is created by the %read macro. This is based on the assumption that the %write macro call will be made only after the %read macro has been invoked at least once. The tasks performed by the %write macro are: 1. Read dataset from work data library and store in analysis data library 2. Obtain last modified datetime information and append to meta_<program name> dataset in metadata folder The complete sas code for the %write macro is listed below. Copy the code as is and save as write.sas in the macro folder. %macro write(dsn=); proc datasets nolist; 4
copy in=work out=an; select &dsn; ods listing close; ods output Attributes=meta._temp_attrib_an_&dsn; proc contents data=an.&dsn; ods output close; ods listing; data meta._temp_attrib_an_&dsn; length metadata attrib $100; format datetime datetime.; set meta._temp_attrib_an_&dsn; where compress(upcase(label1))='lastmodified'; metadata=upcase("an.&dsn"); attrib="output DATA"; datetime=nvalue1; keep metadata attrib datetime; proc datasets library=meta nolist; append base=meta.meta_&progname data=meta._temp_attrib_an_&dsn; delete _temp_attrib_an_&dsn; proc sql undo_policy=none; create table meta.meta_&progname as select distinct * from meta.meta_&progname; %mend write; SAMPLE ANALYSIS DATASET PROGRAM Fig. 5 shows %read and %write macro calls in a sample analysis dataset program. Fig. 5 Sample analysis dataset program --SAS PROCESSING %read(lib=an,dsn=dem); --SAS PROCESSING %read(lib=extract,dsn=ae); --SAS PROCESSING %write(dsn=ae); 5
USING THE METADATA INFORMATION SAMPLE APPLICATIONS #1 GENERATING DATA DEPENDENCY LISTING Macro %gen_dep_list uses the metadata information to create a dependency list of all analysis datasets. Fig. 6 shows the proc print output generated by the macro. Fig. 6 Proc Print ouput generated by %gen_dep_list The complete SAS code for the %gen_dep_list is listed below. libname meta "<path of metadata folder>"; %macro gen_dep_list; ods listing close; ods output members=memlist(keep=name); proc datasets library=meta memtype=data; ods output close; ods listing; data _null_; set memlist end=last; call symput('m' compress(put(_n_,best.)),compress(name)); if last then call symput('n',compress(put(_n_,best.))); %do dcnt=1 %to &n; data &&m&dcnt; set meta.&&m&dcnt; if attrib=:'input' then ordr=1; if attrib=:'output' then ordr=2; if attrib=:'program' then ordr=0; proc sort data=&&m&dcnt; by ordr metadata; data &&m&dcnt(keep=pgmname metadata inputdsn rename=(metadata=outputdsn)); retain inputdsn pgmname; length inputdsn $200 pgmname $20; set &&m&dcnt; by ordr metadata; if ordr=0 then pgmname=compress(metadata); if ordr=1 and first.ordr then inputdsn=trim(inputdsn) compress(metadata); else if ordr=1 and not first.ordr then inputdsn=trim(inputdsn) ', ' compress(metadata); if ordr=2 then output; %end; 6
data metadata; set %do dcnt=1 %to &n; &&m&dcnt %end; ; proc sort data=metadata; by pgmname; proc print data=metadata; var pgmname outputdsn inputdsn; %mend gen_dep_list; %gen_dep_list; #2 EXECUTION ORDER FOR ANALYSIS DATASET PROGRAMS Macro %exe_order uses the metadata information to obtain the execution order for analysis dataset programs. Fig. 7 shows the proc print output generated by the macro. Fig. 7 Proc Print output generated by %exe_order PROGRAM ANALYSIS DATASET PRIORITY a_dem.sas dem.sas7bdat 1 a_ae.sas ae.sas7bdat 2 a_subjchar.sas subjchar.sas7bdat 3 Fig. 8 Relationship between programs, analysis datasets and priority values The algorithm used in this macro is briefly described below: 1. Create a dataset which contains metadata information for all analysis dataset programs. 2. If the analysis dataset program is not using any other analysis dataset, keep only one observation with inputdsn missing otherwise keep all observations where inputdsn begins with AN. 3. For analysis datasets that have inputdsn missing, assign a priority =1 otherwise assign priority =0. 4. For any analysis dataset with a priority=0, iterate through the dataset to check if all input analysis datasets have been assigned a priority value greater than 0. If all input analysis datasets have a non zero priority value then assign priority value = maximum of (priority values of input datasets) + 1, otherwise reset priority value to 0. 5. Once all analysis datasets have a non-zero priority, the program execution order is determined as the maximum value of priority assigned to each program. 7
The complete SAS code for the %exe_order is listed below. %macro exe_order; ods listing close; ods output members=memlist(keep=name); proc datasets library=meta memtype=data; ods output close; ods listing; data _null_; set memlist end=last; call symput('m' compress(put(_n_,best.)),compress(name)); if last then call symput('n',compress(put(_n_,best.))); %do dcnt=1 %to &n; %end; proc sql undo_policy=none; create table &&m&dcnt as select distinct progname, outputdsn, case when cnt=0 then '' else inputdsn end as inputdsn from (select *, sum(index(inputdsn,'an.')) as cnt from (select a.metadata as inputdsn, b.metadata as outputdsn,c.metadata as progname from (select * from meta.&&m&dcnt where attrib='input DATA') a, (select * from meta.&&m&dcnt where attrib='output DATA') b, (select * from meta.&&m&dcnt where attrib='program NAME') c)) where cnt=0 or index(inputdsn,'an.') > 0; data metadata; set %do dcnt=1 %to &n; &&m&dcnt %end; ; if inputdsn='' then priority=1; else priority=0; %let misspr=1; %do %while (&misspr > 0); proc sql undo_policy=none; create table int1 as select * from metadata where inputdsn in (select outputdsn from metadata where priority ne 0) order by progname,outputdsn,inputdsn; create table int2 as select distinct a.progname, a.inputdsn, a.outputdsn, sum(b.priority,1) as priority from int1 a, metadata b 8
where a.inputdsn=b.outputdsn order by progname,outputdsn,inputdsn; proc sort data=metadata; by progname outputdsn inputdsn; data metadata(keep=progname outputdsn inputdsn priority); merge metadata(in=a) int2(in=b rename=(priority=newpr)); by progname outputdsn inputdsn; if a; if b then priority=newpr; proc sql undo_policy=none; create table metadata as select inputdsn, outputdsn, progname, case when min(priority)=0 then 0 else max(priority) end as priority from metadata group by progname, outputdsn; proc sql; select count(*) into: misspr from metadata where priority = 0; %end; proc sql; create table pgmordr as select distinct progname length=20 format=$20., max(priority) as priority from metadata group by progname order by priority, progname; proc print data=pgmordr; var progname priority; %mend exe_order; %exe_order; CONCLUSIONS The technique discussed in this paper makes 3 assumptions: 1. Each analysis dataset program within a study will read and write datasets only using the %read and %write macro calls. 2. The metadata folder will contain only datasets created by %read and %write macro calls. 3. The datasets contained within metadata folder will not be renamed, modified or deleted. If these assumptions are not complied with, then the metadata information will be incomplete and unusable. The %read and %write macro calls create metadata datasets for each program and not a single dataset within a study which will encompass information pertaining to all analysis datasets. This is because SAS/SHARE is required for concurrent update of a single SAS dataset whereas this technique avoids the need for concurrent updates. 9
Although this technique will accommodate programs that generate multiple analysis datasets, it is advisable to use one program to generate only one analysis dataset. This technique has been successfully tested in batch mode using SAS version 9.2 on Windows Operating System. ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: Binoy Varghese Cybrid Inc., Wormleysburg, PA mailme@binoyvarghese.com www.clinicalsasprogramming.com Satyanarayana Mogallapu IT America Inc., Edison, NJ mogallapuvs@yahoo.com 10