Using SAS software to fulfil an FDA request for database documentation Introduction Pantaleo Nacci, Adam Crisp Glaxo Wellcome R&D, UK Historically, a regulatory submission to seek approval for a new drug (or for a new indication of an already marketed one) consists of a huge amount of paper; lately various regulatory authorities, and particularly the Food and Drug Administration (FDA), have begun to allow the presentation of Clinical Report Forms (CRF) and Clinical Report Tabulations (CRT) in electronic format, and to informally request electronic copies of all data used in the analysis. This trend is reflected by two new Guidances for Industry, still draft, titled Electronic Submission of Case Report Forms and Case Report Tabulations and Submitting Application Archival Copies in Electronic Format. These documents can be found on the WWW at http://www.fda.gov/cder, following the link to the Regulatory Guidance section. The Problem During the pre-nda phase for a new drug, some of the reviewers asked for a copy of all the SAS data sets used during the analysis, to ease and speed up the reviewing process by being able to run his own analyses: in our personal experience this was the first time we received a request of this nature. The core of the NDA would encompass seven large well-controlled clinical trials, and we had to supply data for all of them. The first discussion (very short indeed) was about what to put into these data sets, but that was quite clear from the initial request: the data sets sent had to be as complete as possible, e.g. containing both raw and LOCF ed efficacy data. Then we came to more practical problems: first of all, we had to deliver the SAS data sets, complete with all formats permanently linked, not knowing which platform would be used to access them; then we had to prepare a two-part document for each protocol, describing in the first section the generic contents of each datasets, and containing in the second one...the record and field description (including data type and locations) as well as a narrative definition of each field for each data set. In practical terms, we needed a document conveying in a structured manner all the information the reviewer could eventually need, so as to avoid any unnecessary delays in the review process. To recap, for each protocol we had to provide the reviewers with basically three things: the SAS data sets a catalog containing all the formats used by these datasets a document explaining what information was contained in each data set and how it was stored The Initial Approach Putting temporarily aside the problem of how to deliver data sets and formats, we concentrated on the documentation: if approached in the usual way, i.e. get someone to type everything using a word processor, the task was enormous, and there were ample spaces for typing errors; moreover, the quantity of information we could convey was necessarily limited by time constraints. The type of information to be presented in the documentation was quite clear, so we quickly drafted a two-part document, structuring it loosely along the lines of a PROC CONTENTS output. The first section would contain simply the names of all the data sets and their description, while the second had to specify as much information as possible about all variables in the data sets. The draft versions of the sections looked like this: First section Protocol XXXXXXXX Index of data sets
Data Set Name Description ADVERSE Adverse Events DEMOG Demographic Details DIARY Diary cards data...... Second section Protocol no. XXXXXXXX Description of variables in the ADVERSE data set Variable Name Description Format SUBJECT Subject number PTID Protocol code $7. SEX Sex $VSEX. AEVTX Adverse event verbatim text $66.......... Again, this document had several severe downfalls, being both very cumbersome to prepare and not showing any information about the external formats eventually attached to the variables, apart from their names: the reviewer should have had to switch to another document to see the actual codes. The Breakthrough Work had already begun, because timings were extremely tight, but even the idea of at least having to check the final documents for mistakes was not pleasant: we definitely had to find another way, both faster and more flexible, to easily prepare the document (or, better, have it prepared) for each protocol. We started from the initial idea of creating a document resembling the PROC CONTENTS output: the first part of the document could be created using data set labels, really a piece of cake, while the second was at the moment nothing more than a stripped down and rearranged version of the procedure s output. And since we were going to fiddle around, it would have been really nice to have, next to each variable, all the codes used plus the relative text instead of the format s name. To increase the educational value of the exercise, we also decided to try and use PROC SQL, which we were told was also much faster. We then redrafted the second section of the document in line with these ideas, and agreed to deliver for each data set a page (or a set of pages) structured like this:
Protocol no. XXXXXXXX Description of variables in the ADVERSE data set Variable Full Description Abbreviations SUBJECT Subject number n/a PTID Protocol code n/a SEX Sex F=Female M=Male AEVTX Adverse event verbatim text n/a......... Having decided the format of the document, we switched to the other problems; we quickly agreed that the best way to deliver the data sets was to use the XPORT engine, but for the formats catalog we were willing to try something else. What about recreating the original SAS code used to create the catalog? That way, we would be shipping to the reviewer the three pieces he needed as a SAS transport file, a SAS program and a text document: very little space left for trouble, because all he needed to do was to define some libraries, run the SAS program to recreate the formats catalog and then use PROC COPY to obtain the data sets with all the necessary formats already attached. A Quick Tour The final result of these efforts are two programs, called FINDFMT and INDEX; they can be found later. A brief explanation of the various steps in each program follows. FINDFMT.SAS The first lines are generic, and set up the necessary environment variables according to the platform the program is running on (in our case, VMS or Windows 3.1) Using PROC SQL, a data set is created, containing the names of all non-sas-provided formats used by the data sets in a library; this data set is then merged with another one, containing the names of all formats available in all libraries; if a format is present more than once, the one defined in the local library is selected (e.g. one of the central formats has some spelling errors, so it had to be recreated locally) Two lists of formats to be SELECTed, one from each catalog, are created using the macro language; these formats are then output as data sets using PROC FORMAT s CNTLOUT= option; a DATA _NULL_ step is finally used to create the SAS program, ready to be included and run later. INDEX.SAS Again, the first lines are generic, and set up the necessary environment variables according to the platform the program is running on. Using PROC SQL, a data set is created, containing the names of all the data sets in a designated library (STATDATA in our case). Using the data sets labels the first part of the document is created; at the same time, a list of these data sets is compiled and their number is saved in a macro variable (MEMNUM) From within a macro (DATLOG), another macro (CHECK) is called MEMNUM times (i.e. once for each data set), passing the names picked from the list as parameters: this is where the second part of the document is built up, one data set at a time. Using PROC SQL again, the program gets each variable s name, label and name of the attached format; then a list of variables having external formats attached is built, and the total number saved in a macro variable (NUM) If NUM is greater than zero (i.e. at least one variable has an external format attached), all observed values for each variable in the list are collected in a data set using PROC FREQ s OUT= option; these values are then used to get all the abbreviations by PUTting them using the relative format, and the resulting information merged with the initial data set. A DATA _NULL_ step is used to append the page(s) for the current data set to the document.
As the very end, the program creates a transport file containing all the data sets, using the XPORT engine; the CC=NONE option is necessary to eliminate some compatibility problems. Conclusions Overall, the time needed to write and optimise these two programs amounted to about two working days; since then they have been used on a number of protocols in different therapeutic areas, usually requiring only minor adjustments to execute properly. The only requirement for these programs is that the data sets must be complete, meaning that all labels have to be in place and all formats properly attached: in a well organised working environment this should be anyway the normal way of documenting data sets. The only SAS module needed is SAS/BASE. The amount of time saved using these programs, otherwise spent on mundane activities, has been considerable, but even more important is the quality of the final result, virtually free of any unnecessary human error; office automation systems, and personal computers in particular, exist just to free us from daunting and repetitive tasks, and re-entering existing information surely qualifies for that definition. For further information, please contact: Mr Pantaleo Nacci Glaxo Wellcome R&D MDS European Clinical Statistics Greenford Road Greenford Middlesex UB6 0HE United Kingdom e-mail address: pn3755@glaxowellcome.co.uk SAS and SAS/BASE are registered trademarks of SAS Institute Inc., Cary, NC, USA.
The FINDFMT program %let ptid = XXXXXXXX; %let mdp = %substr(&ptid, 1, 3); %macro get_os; %if "&sysscp" = "VMS" %then %do; %let rootdir = %str(mds_&mdp:[&ptid); %let middle = %str(.); %let end = %str(]); %let goldfmts = %str(gold$pdata:[sas_dict]); % %else %if "&sysscp" = "WIN" %then %do; %let rootdir = %str(c:\&mdp\&ptid); %let middle = %str(\); %let end = %str(\); %let goldfmts = %str(l:\gold\sas_dict); % %mend get_os; %get_os; /*-------------------------------------------------------------------- Send log and output to files --------------------------------------------------------------------*/ proc printto log = "&rootdir.&middle.saslog&end.findfmt.log" print = "&rootdir.&middle.sasout&end.findfmt.lis" new; /*------------------------------------------------------------------- Program Name : FINDFMT.SAS Program Version : Version 1 Program Purpose : To create a SAS PROC FORMAT step which contains all the formats used for a protocol, to use with FDA data sets SAS Version : SAS 6.08 (VMS) Program Created By : P Nacci Date : 09 October 1996 -------------------------------------------------------------------*/ %put %quote( ) &sysdate &systime; /* log time and date */ options linesize = 132 pagesize = 60 pageno = 1 nonumber nodate notes source nosource2 mprint fmtsearch = (work library sasfmt); title; footnote; libname sasfmt "&goldfmts"; libname library "&rootdir.&middle.sasfmts&end"; libname sasview "&rootdir.&middle.sasview&end"; libname statdata "&rootdir.&middle.statdata&end"; ; filename formats "&rootdir.&end.formats.sas"; * Get the names of all non-sas-provided formats used in the datasets ; proc sql; create table temp1 as select distinct format from dictionary.columns where libname = 'STATDATA' & memtype = 'DATA' & compress(format, '$0123456789.')
^in (' ', 'DATE', 'TIME', 'DATETIME', 'BEST', 'Z') order by format; quit; * Now those of all formats contained in all defined catalogs ; data temp2 (keep = libname format); set sashelp.vcatalg; where memname = 'FORMATS' & libname in ('LIBRARY', 'SASFMT'); length format $ 9; format = compress(objname) '.'; if objtype = 'FORMATC' then format = '$' format; proc sort data = temp2; by format libname; * Merge the two sets to check from which catalog each format must be taken If multiple formats have the same name, give precedence to local one ; data temp3; merge temp1 (in = in1) temp2; by format; if in1; format = compress(format, '.'); proc sort data = temp3 out = temp4 nodupkey; by format; data _null_; set temp4 end = eof; retain num1-num2 0; select (libname); when ('LIBRARY') do; num1 + 1; maclocl = 'locl' compress(put(num1, 3.)); call symput(maclocl, format); when ('SASFMT') do; num2 + 1; macgold = 'gold' compress(put(num2, 3.)); call symput(macgold, format); otherwise; if eof then do; call symput('loclnum', compress(put(num1, 3.))); call symput('goldnum', compress(put(num2, 3.))); proc datasets lib = work mt = data; delete trans1 trans2; quit; %macro franz; * Local formats; %if &loclnum ^= 0 %then %do; proc format lib = library cntlout = trans1; select %do i = 1 %to &loclnum; % ; % &&locl&i
* Central formats; %if &goldnum ^= 0 %then %do; proc format lib = sasfmt cntlout = trans2; select %do i = 1 %to &goldnum; % ; % %m %franz &&gold&i data trans; set trans1 trans2; proc sort data = trans out = temp1; by fmtname type; data _null_; set temp1 end = eof; by fmtname type; file formats; select (type); when ('C') do; name = '$' fmtname; sepa = "'"; when ('N') do; name = fmtname; sepa = ''; if sexcl = 'Y' then lower = '<'; else lower = ''; if eexcl = 'Y' then upper = '<'; else upper = ''; texta = sepa compress(start, ' *') sepa; textb = sepa compress(start, ' *') sepa lower '-' upper sepa compress(end) sepa; if _N_ = 1 then put 'proc format lib = library;'; if first.type then put ' value 'name; if start = end then put @4 texta '= "' label '"'; else put @4 textb '= "' label '"'; if last.type then put ';'; if eof then put ''; /*-------------------------------------------------------------------- Revert log to LOG window, and output to OUTPUT window --------------------------------------------------------------------*/ proc printto;
The INDEX program %let ptid = XXXXXXX; %let mdp = %substr(&ptid, 1, 3); %macro get_os; %if "&sysscp" = "VMS" %then %do; %let rootdir = %str(mds_&mdp:[&ptid); %let middle = %str(.); %let end = %str(]); %let goldfmts = %str(gold$pdata:[sas_dict]); % %else %if "&sysscp" = "WIN" %then %do; %let rootdir = %str(m:\&mdp\&ptid); %let middle = %str(\); %let end = %str(\); %let goldfmts = %str(l:\gold\sas_dict); % %mend get_os; %get_os; /*-------------------------------------------------------------------- Send log and output to files --------------------------------------------------------------------*/ proc printto log = "&rootdir.&middle.saslog&end.index.log" print = "&rootdir.&middle.sasout&end.index.lis" new; /*------------------------------------------------------------------- Program Name : INDEX.SAS Program Version : Version 1 Program Purpose : To create a document describing the contents of all data sets contained in a library, according to FDA request SAS Version : SAS 6.08 (VMS) Program Created By : A Crisp Date : 02 October 1996 Modified By : P Nacci Date : 22 October 1996 -------------------------------------------------------------------*/ %put %quote( ) &sysdate &systime; /* log time and date */ options linesize = 132 pagesize = 60 pageno = 1 nonumber nodate notes source nosource2 mprint fmtsearch = (work library sasfmt); title; footnote; libname sasfmt "&goldfmts"; libname library "&rootdir.&middle.sasfmts&end"; libname sasview "&rootdir.&middle.sasview&end"; libname statdata "&rootdir.&middle.statdata&end"; ; filename index "&rootdir.&end.index.p07"; * Dimensions of output file; %let cols = 106; %let rows = 77;
%macro check (data); proc sql; create table conts as select name, label, format from dictionary.columns where libname = 'STATDATA' & memname = "&data" & memtype = 'DATA' order by name; quit; proc print data = conts; title "Contents of &data"; %let num = 0; data _null_; set conts end = eof; where compress(format, '.$0123456789') ^in (' ', 'DATE', 'TIME', 'DATETIME', 'BEST', 'Z', 'CHAR'); macv = 'var' compress(put(_n_, 3.)); macf = 'fmt' compress(put(_n_, 3.)); call symput(macv, name); call symput(macf, format); if eof then call symput('num', put(_n_, 3.)); %if &num = 0 %then %goto the_ proc datasets lib = work mt = data; delete log; quit; %do i = 1 %to # %let var = &&var&i; %let fmt = &&fmt&i; proc freq data = statdata.&data noprint; format &var; tables &var / out = value (keep = &var rename = (&var = value)); data build (drop = value rename = (value_ = value)); set value; length name value_ $ 8 descript $ 50; name = "&var"; value_ = compress(value); if value_ not in (' ','.') then do; descript = compress(value_) '=' put(value, &fmt); output; proc append base = log data = build; % proc sort data = log; by name; data conts; merge conts log; by name; %the_end: data _null_; set conts end = eof; by name; retain col1-col3 (2 12 54)
ctd 0; file index print notitle ls = &cols ps = &rows header = newpage linesleft = ll mod; if &num = 0 descript = ' ' then descript = 'n/a'; if ll < 4 & not eof then do; ctd = 1; put &cols.*'_' _page_; if first.name ctd then do; put @col1 name @col2 label @col3 descript; ctd = 0; else put @col3 descript; if eof then put &cols.*'_'; return; newpage: put @col1 "Protocol &ptid" // @col1 "Description of variables in the %upcase(&data) dataset" @; if ctd then put ' (Continued)' @; put &cols.*'_' // @col1 'Variable' @col2 'Full Description' @col3 'Abbreviations' / &cols.*'_' /; %mend check; %macro datlog; %do j = 1 %to &memnum; %let dat = %scan(&list, &j); %check(&dat) % %mend datlog; proc sql; create table temp as select distinct memname, memlabel from dictionary.tables where libname = 'STATDATA' & memtype = 'DATA' order by memname; quit; data _null_; set temp end = eof; length tmp $ 200; retain col1-col2 (2 25) tmp; file index print notitle ls = &cols ps = &rows header = newpage new; put @col1 memname @col2 memlabel; tmp = trim(tmp) ' ' compress(memname); if eof then do; call symput('list', tmp); call symput('memnum', put(_n_, 3.)); put &cols.*'_'; return; newpage: put @col1 "Protocol &ptid" // "Index of data sets" / &cols.*'_' // @col1 'Data Set Name' @col2 'Description' / &cols.*'_' /; return; %datlog; libname xpt xport "&rootdir.&end.&ptid..xpt" cc = none; proc copy in = statdata out = xpt mt = data;
/*-------------------------------------------------------------------- Revert log to LOG window, and output to OUTPUT window --------------------------------------------------------------------*/ proc printto;