Standardizing Data Processing and e-publishing for the Pharmaceutical Industry

Standardizing Data Processing and e-publishing for the Pharmaceutical Industry Shawn Wang MedXview, Inc.,Cambridge, MA ABSTRACT As timing and efficiency become ever more important for every company to secure a winning position in New Drug Development, after FDA requirements for e-submissions after 2002, how are we going to get the job done quickly and accurately? Standardization is an effective generic solution applied in business re-engineering. Applying standard at every step will yield large time-compression that will result in more efficient use of available resources. This paper briefly discusses standardizing data process and e- Publishing in clinical trial and electronic submission, providing real-life examples to demonstrate the superiority of standardization processing over other approaches. 1 INTRODUCTION 1.1 Needs for Standardization: Standardization can save the pharmaceutical industry up to 80% of data processing and e-publishing time. Without standardization, the likelihood of duplicating work is imminently present. Standardization, however, provides the opportunity of leveraging resources globally by re-using data base designs and program structure design from one trial to another. Since 1999, FDA has been preparing standards for different types of electronic submissions. Beginning in 2002, FDA requires paperless submissions to improve efficiency. Pharmaceutical companies need to bring their current systems up to date to meet the new FDA requirements. The following business imperatives have changed the way we do our work: FDA has been continually characterizing the standards for data, such as content, structure, and format. FDA reviewers have been upping the ante in demanding that your company must submit the SAS programming used for item 10 so that they can get exactly the same output as your submission. The volume of work is rapidly increasing because new drugs are simultaneously being developed. Dramatic changes in new computation techniques. To do our work faster and more effectively, better business practices are needed. Standardization is a generic solution applied in business re-engineering. Applying standardization in each drug development phase will yield large time-compression that will result in more effective use of available resources. 1.2 General Ideas for Standardization: Managing, analyzing and reporting individual clinical trials can be completed independently. However, integrating all of the data and the results can become a very challenging task when the data bases and SAS programs of the individual clinical trials were designed and constructed without having globalization and standardization in mind. Therefore it is necessary to define and construct similar data bases and SAS programming environments for each trial in terms of structure and content that are based on similar protocols and CRFs. Most companies are increasingly aligning their globalization and standardization activities with the International Conference on Harmonization (ICH), which supports Software Development Life Cycle (SDLC) principles. The SDLC has been defined by NIST (National Institute for Standards and Technology) as: The period of time beginning when a software product is conceived and ending when the product is no longer available for use. The software development life cycle is typically broken into phases denoting activities such as requirement, design, programming, testing, installation, and operation and maintenance. SDLC is certainly the key for building standardizing report systems. 1.3 Phases of Standardization 1.3.1 Requirements Review the company s guidelines, such as SOP, SAP, AnnotatedCRF and FDA guidance. Discuss inconsistencies in information with other team members. Resolve any data discrepancies and finalize the requirements. 1.3.2 Design Identify standard and non-standard tables, listings, and figures. Use existing SAS programming tools and read the specifications from the statistician to process standard tables, listings and figures, and annotated mock-ups of nonstandard tables, listings and figures. Propose quality control test plans. Discuss and revise the draft tables, listings and figures with other team members. 1.3.3 Development Convert programming specifications into SAS code. Produce quality control test scripts. Conduct testing and debugging. Conduct software toolkits or programs in review meetings with the other team members. Provide draft tables, listings and figures to the other team members for reviewing and approval. 1.3.4 Maintenance Collect feedback from FDA reviewers and other team members. Determine new software toolkits or SAS programs for nonstandard TLF to be promoted to project-level standard TLF. Update validation documentation. To summarize this: Three principles: standards, modeling, regulatory compliance. One tip: modular approach. 2. OVERVIEW OF DATA PROCESSING AND E-PUBLISHING 2.1 Key Point: Two-dimensional Rectangular Form All the data sets, tables and listings are presented in a twodimensional rectangular form rows and columns. Furthermore, no matter whether we using standardizing systems or writing programs without standardizing, we actually deal with two different sections in each table: (1) marginal section, which includes titles, headers, footers and labels on the side.

(2) interior section, the pure data value from the SAS data set. All the data, tables, listings and figures can be standardized by setting up standards and models. Under certain specific standards and models, the data structure and TLF can be wellestablished by sets of SAS macros or SCL methods. There are several standards and models that are used as the basis for development. Guidance for Industry Providing Regulatory Submissions to the Center for Biologics Evaluation and Research (CBER) in Electronic Format, Providing Regulatory Submission in Electronic Format NDA, and CDISC Submission Meta Data Model, Version 2.0 are all good references. FDA guidance is more a general consideration than an actual model. CDISC standards and models are based on the 80% rule (only 80% of coverage at most is guaranteed to match) and have some deviations. There is no unique and perfect fit solution for all pharmaceutical companies. Different companies need to build the standards and models which will bridge the gap between FDA requirements and their specific organizational preferences and practices. A tailor-made standard report system does not just suddenly appear, it needs to be developed. 2.2 Data Standardization To paraphrase Tolstoy in Anna Karenina, All of the standardization approaches are the same, but every nonstandardization adventure is different in its own labor-intensive way. There are a lot of ideas flowing around for building e-sub standards, models, and systems. But it s important to think about what is basic. What we are dealing with is nothing but a twodimensional rectangular form, as section 2.1 states. The actions we can take in a two-dimensional rectangular data structure are very limited. For example, we can sort by either row (observation) or by column (variables), or both. That s all we can do for sorting. In the sorting action, we write standard reusable macros to handle sorting. There are eight important actions we may need to perform on raw data in order to create CRT data (FDA guidance item 11), and these are: 1) SORTEDBY, which sorts Key Variables that can be used for define.pdf generation. And thus sorted data values (horizontal in table) and data vectors (vertical in table) will be convenient for statistical analysis. 2) KEEP, which transfers variables from raw data (CRF) into CRT data sets, without needing any changes. 3) DROP, which removes from raw data (CRF) only the variables for data management purposes. 4) NAME CHANGE, which gives new names to variables in the raw data (CRF) which do not meet CRT data requirements - such as variable names of over 8 characters, or duplicated name, or names with notexplicit meaning. 5) CHARACTER TO NUMERIC CHANGE, which converts time/date characters to numeric for statistical calculations. 6) NUMERIC TO CHARACTER CHANGE, which converts numeric raw data (CRF) entries to letter codes. (1 to Yes and 2 to No) 7) VALUE CHANGE, which converts improper raw data (CRF) values, such as negative height or weight, or improperly defined values, such as missing value -99, to appropriate CRT statistical values and definitions. 8) LABEL CHANGE, which renames non-explicit or improperly made labels, such as those with over 40 characters, to conform to FDA requirements. It s obvious that these can be done automatically by well-defined SAS macros or SCL methods, and can be easily implemented by setting the standards and models. All the user requirements can be put in an Excel file -- SAS can read the Excel file and create the individual SAS program for the specific domain very easily. A new trick from SAS/ODS exchanges information between SAS and Excel CSV files without opening the Excel file. We have been successful in developing all of the required generic macros and SAS/SCL interfaces and toolkits to help automate procedures, so the above eight actions can be accurately and efficiently achieved right now. The following is an example of standards and models to create CRT data sets: 2.2.1 Standards The following standards are being implemented as part of the pharmaceutical industry s standards to incorporate FDA guidelines into a company s developmental process: Standard for creating CRT data sets and variables: Define Data set Name Define Key Variables Define Common Variables Modify Remaining Variables Sort by Key, Common, and Remaining Variables. Avoid Software Issue Avoid Attribute Issue CRT Data set name must be <=8 characters. Data set name must be identified by using 6 or 7 characters; the last one or two characters can be 0 or _0 such as demog0. And make sure there are no duplicate data set names, since all the data sets will be placed in the same folder at submission. Key variables which uniquely identify each record in the data set. The variables that are added to each CRT data set (such as subjid, trtgrp, sex, age). Eight actions should be applied to each variable to further clean up any uncertainties and conflicts in raw data sets and reduce ambiguity for submission reviewers. The data set can be sorted by key variables and common variables, then the order collected on CRF. The SORTEDBY key variable will be reflected in the data definition table. All the verbatim variables and original variables should stay together. Versions 6 or higher to create version 5 XPT files. Make sure there are no version conflicts, such as length of label, variable name, and format. Variable name should be unique, <=7 characters, the last character can be used for identifying special meanings for variables in CRT data sets (such as VISITID, VISITIDZ). Variable names will not exceed 8 characters. Version 5 requires a label of <=40 characters. All numeric variables will create new twin variables associated by using its format decoding. 2.2.2 Model for CRT Data Set The model contains the following components in each domain: variable name, variable label, key order, variable type, variable length, formats, and comments. The Data Definition table (Define PDF) is the reflection of the data model defined by the company s own practice. SAS can create hyperlinks and bookmarks in PDF. There are several vendors out there creating define.pdf by using VB, C++, and other software languages. What they do is to use SAS as the backbone software to manipulate SAS data sets and to create SAS output; they create a VB or VC interface for transferring the SAS output into a Microsoft Word file and later convert into a PDF file. This process is too laborious. Since there are several

softwares involved, sometimes manually entered data is needed, such as hand-typing the page numbers and other information. If the variables appear in multiple CRF pages, or if formats have been created from multiple SAS catalogs, there is going to be a problem. To customize a third party software can be very difficult, since all the output files are from SAS. It makes sense to use SAS to create all e-publishing tools directly. We have successfully used SAS to create SAS transport files and all our PDF reports, such as define.pdf, and combine all the tables, listings, and figures, together with bookmarks and hyperlinks. 2.2.3 Summary The first component of data processing contains all data collected in a clinical trial - whether it is paper CRFs, ancillary data on tape, or other paper- or computer-based data. These are named raw data. The second component is the CRT data. This data is organized in a normalized structure that is most suited for FDA guidance item 11. Data standardization implies 8 actions to transfer raw data (CRF) into standard data (CRT). The third component, the analysis data, is intended for two usages: (1) it constructs row and column group variables and outcome variables. (2) it adds calculated values, derived variables, and analysis decisions onto (1), and creates a data set that includes the statistical summary for final presentation in tables, listings and figures. Standardizing this type of data is associated with TLF standardization, which follows. 2.3 TLF Standardization 2.3.1 Four Phases in TLF Generation 1. Obtaining the clinical trial raw data from CRF and QCing it. 2. Creating CRT data sets by using FDA guidance based on CRF data, for instance DEMOG0 data sets. 3. Creating analysis data sets based on CRT data sets: (i)analysis data sets extracted from CRT data sets, such as DEMOG1 extracted from DEMOG0, contain only row group identifiers, column group identifiers and outcome variables. (ii)all statistics and data reconstruction are applied on the extracted data set, such as DEMOG1, and a final report data set is generated, such as DemogFinal (DEMOG2). 4. Report generation. 2.3.2 Methodologies There are two types of methodologies that are available quickly and efficiently to generate tables, listings and figures: 1. Format-driven 2. Template-driven 2.3.2.1 The Template-Driven Approach DIAGRAM OF TEMPLATE DRIVEN MECHANISIM Start at SEPCS Get CRT data sets Demog 0 Create analysis Data AnalyDs1 Create Final report data FinalDs Reporting Macro T.L.F PDF/RPF SPEC TEMPLATE (ASCII) SAS/ODS SAS/MACRO EXCEL PDF/RTF Output Statistician SAS programmer 2.3.2.2 The Format-Driven Approach FDA has required submission of all formats used. To present all the meta data in SAS format not only will help the FDA reviewer to find quickly all the information you have provided by looking into the format folder, but it will also help your company standardize data processing and e-publishing. Here is a example to create drug name format: proc catalog cat=library.newfmts kill; quit; proc format library=library.newfmts ;

value $drugnm 'DRUGNM' ='ALVCC'; proc format library=library.newfmts fmtlib; within the SAS program, the SAS macro %set_drugnm(lib=library,fmtcat=newfmts,format=drugnm); will automate the information into a proper place. The SAS format has some very nice features, such as other and picture, and Version 9 and upcoming new versions will provide more options to save a lot of time spent in typing information. We have been using 100% SAS format to input all meta data, such as comments, titles, footnotes and headers, and to change any discrepant data. 2.3.2.3 Comparison of Template-driven and Formatdriven Approaches: In general, the pros and cons of format-driven vs. template-driven are: format-driven is more automatic, and requires fewer modifications. template-driven is better for making as-needed modifications over time. They both have great advantages over any non-standardizing programming approach. Using a template-driven or a format driven approach will collect all the value added information or marginal information which is not from the SAS data set into one place either a template (an ASCII file) or a format (in FDA guidance item 10 format folder). The FDA reviewer will appreciate that any non-sas data set information can be easily found, and your company can access or modify that information quickly. 2.3.3 Modification of Marginal Information. The difference between the format-driven approach and the template-driven approach is in how the marginal section is presented. In the template-driven approach, the Life Cycle started at two meta data files: (1) the specs from the statistician which include all user requirements, and (2) the template to define and amend the final table marginal sections. A project-free SAS macro can read both the final data set (the second analysis data set) and the template at the same time to create a table output. A set of pre-defined SAS macros will apply the statistical tests and specs to create the second analysis data set from the first analysis data set. A set of SAS macros will create row, column, and outcome variables from CRT data derived from the specs. In the format-driven approach, all the margins in formation can be stored in one SAS format catalog file. To modify the final TLF, you only need to change the format in a newfmts.sas. Recreate the format once, and the corresponding titles, footnotes and headers, as well as any other information you may want to insert into the final report will change immediately. Here is a format that has been generated for titles: value $f_title '1'='ABC Pharmarceutical, Inc. '2'='DRUG A: ACD-mI PR-T Conjugate '3'='Study Report: 01-12-99 '4'='Figure 9.&num_fig..&g '5'='Distribution Curves for &testname' '6'='&&bld&g.-4th Dose ; Here is how the titles have been created simultaneously. %macro titles; data _null_; %do l= 1 %to &num_titles; %global f_title&l;; %let f_title&l =%sysfunc(putc(&l,$f_title.)); %if &l < 4 %then %do; title&l h=0.6 j = l ls = 0.3 f=times "&&f_title&l"; % %else %do; title&l h=0.6 ls = 0.3 f=times "&&f_title&l"; % % ods proclabel="&f_title4"; %m You don t need to change any code in the program if you want to change the title, footnote and header. You only need to change the format in the format library by rerunning newfmts.sas. 2.3.4 Modification of Interior Section We are working on the two-dimensional rectangular form rows and columns. The data intersection of tables can be summarized to some models. Here are some examples MODEL 1: one column and categorical data MODEL 2: one column of continuous data MODEL 3: one column of mixed continuous and categorical data MODEL 4: multiple columns of categorical data MODEL 5: multiple columns of continuous data MODEL 6: multiple columns of mixed continuous and categorical data MODEL 7: multiple columns of categorical data with p-values MODEL 8: multiple columns of continuous data with p-values There are only a few models you may need to use. For example, we only need descriptive statistics n and pct for categorical data. And n, mean, median, S.D. Min, Max are descriptive statistics for continuous data. Sometimes we need to calculate p-values as well. They can be generated by well-defined SAS macros, as long as we have the row group identifier variable, the column group identifier variable, and the outcome variable. The macros for one column and for multiple column calculation can be different, but the idea is the same. For example, the macro to calculate n and pct for one-column categorical data can be written as: %macro GetNpctOneCol(varnm =, i =, where_ =, pct_y_n =, totalpat = ); proc sql; create table &varnm._ as select &i as row_grp, count(distinct patid) as count from &indata &where_ order by row_grp; quit; data out&i (keep=label_ pct row_grp); length label_ $50 pct $15; set &varnm._; label_="&varnm"; %if %upcase(&pct_y_n)=no %then %do; pct=' ' put(count,3.); % %else %do; pct_=100*count/&totalpat; pct=put(count,3.) "(" put(pct_,4.1 -L) ")"; % row_grp = &i; %m The macro to calculate n, median, mean, std and pct for onecolumn continuous data can be written as: %macro GetStatOneCol(varnm=,i=);

proc univariate data=&indata noprint; var &varnm; output out=&varnm._out n=n mean=mean median=median std=std min=min max=max; data &varnm._out(drop=n_n mean_n median_n std_n); length range $30; set &varnm._out(rename=(n=n_n mean=mean_n median=median_n std=std_n)); n = put(n_n,6.); mean = put(mean_n,7.1); median = put(median_n,7.1); std = put(std_n,7.1); range "," trim(put(max,12.1 -L)) ')'; row_grp = &i; proc transpose data=&varnm._out out=&varnm._out2 ; by row_grp; var n mean median std range; data out&i(drop=_name_ col1); length label_ $50 c1 $30; set &varnm._out2; select(_name_); when("n") do; = '(' trim(left(put(min,12.1))) row_ord = 1; label_ = "N"; when("mean") do; row_ord = 2; label_ = "Mean"; when("median") do; row_ord =3; label_ = "Median"; when("std") do; row_ord = 4; label_ = "SD"; when("range") do; row_ord = 5; label_ = "Min, Max"; otherwise; pct=col1; c1=pct; %m For example, for MODEL 3: one column of mixed continuous and categorical data the table looks like this: ABC Corporation Page 1 of 1 Protocol No. VCC-001-99 TABLE 14.1.1 Summary of Subject Demographics Parameter Statistic Total Number of Subjects n 120 Gender Male n (%) 70(58) Female n (%) 50(42) Age (yrs) n 120 Mean 18.2 Median 17 Std. Dev. 1.32 (Min., Max.) (16,19) Height (cm) n 120 Mean 170 Median 172 Std. Dev. 3.41 (Min., Max.) (152,183) Weight (kg) n 120 Mean 142 Median 151 Std. Dev. 5.2 (Min., Max.) (80,210) The macro for creating the final data set can be written as: %let indata=crt.demog0; %GetNpctOneCol( varnm=totalpat, i=1, where_=%str(where ittflag=1), pct_y_n=no, totalpat=120 ); %GetNpctOne Col( varnm=sex1, i=2, where_=%str(where sex=1 ), pct_y_n=, totalpat=120 ); %GetNpctOneCol( varnm=sex2, i=3, where_=%str(where sex=2 ),

pct_y_n=, totalpat=120); %GetStatOneCol(varnm=age,i=9); %let indata=crt.wtht_1; %GetStatOneCol(varnm=ht,i=10); %GetStatOneCOl(varnm=wt,i=11); To create the interior section of a table is to create the final data set. After the final data set has been created, the last step of programming the report generation can also be standardized. 2.3.4 Report Generation in TLF Standardization It s important to separate the final data set from the report section. Often we need to change or QC final data sets, and to change the final report format and layout separately. It s necessary to define a standard TLF layout. Here is an example for defining the TLF output standard: The first row is for the company name (left justified); and for page of (right justified). The second row is for the protocol number. The third row is blank. The fourth row is for table numbers. The fifth row is blank. The sixth row is for titles (center; if too long, wrap it up into the next (seventh) line) Otherwise, The seventh row is blank. The eighth row is the horizontal line. The ninth row is blank. The tenth row is the start of the header (center). The eleventh and subsequent rows will continue the header as needed. After finishing the header comes the middle section where the data is presented. Then comes the footer: The first row is the horizontal line The second row is blank or footnotes The third row is for program name (left justified) and creation date (right justified) Total page size: 44 to 66 lines Line size: 132 to 160 characters It s critical to set up an option such as page size. When the width of a variable is too long to fit in a column, we can use flow options in proc report to wrap it to the next line. It s better to set a macro variable to allow the user to justify the PAGESIZE. Another issue in the report sections is the PDF output. As FDA requires, the TLF outputs have to be in PDF format. In creating PDF format, SAS/ODS can be easily used to produce PDF or RTF. We should pay attention to the following issues: (1) To have one bookmark is easy if we use proclabel. And content= to remove the second child bookmark. But to add more children bookmarks, or to change the font and size of bookmarks, is not an option for currently released SAS. You can still use proc template to add and modify default bookmarks, but you need to write a lot of code to do it. (2) Font size changes in tables: Version 8.2 has an issue in changing the font size when you use an ASCII file as the input file. The default is 6.7 for the font size magnification, and you may need to set an ODS-style element in the proc report, do not try to input ASCII file by using Data _null_. You also can use SAS Version 9 to solve the font issue. (3) Hyperlinks: SAS/ODS can create PDF directly. But to link TLF output back and forth, you need to create a postscript file, and the distiller to PDF which is still much better than to convert RTF to PDF. For links to other types of files, you may use http: and file://. It works very well when we create define.pdf and links to different pages, or links to SAS Transport (XPT) files. There is always a possibility, no matter how complicated a task you may want to do, that SAS Data _null_ can manipulate PDF or postscript code to create anything beyond ordinary. There are some SAS shortcut tricks in the system development software that you should know about, but obviously it s impossible to show everything in such a short paper. 3. CONCLUSION To standardize data processes and e-publishing is necessary and feasible. We have been creating various tools and programs by using SAS in standardizing data processing and e-publishing. We have the SAS/ODS toolkits for e-publishing purposes. Each time it may take a few seconds to run the standard procedure and to produce electronic submission documents, such as SAS Transport (XPT) files and define.pdf, as well as other required bookmarking and hyperlinking documents. Standardization is the only choice for pharmaceutical companies to save time and to increase efficiency in drug development. It is important to understand the basic idea and to do it right. To have customized toolkits ready is part of standardization and globalization procedure, and toolkits that we use are created by SAS only. The software tools and programs generated by SAS to create a standardized data processing and e-publishing system have become available for the pharmaceutical industry, and I think the new version released by SAS will offer even more advanced features which will help us reduce a great deal of cost and time in new drug development. ACKNOWLEDGEMENTS: I would like to acknowledge and thank John Green, John Wenston, Andy Siegel, Linda Barrett, Jennifer Angell, Kathleen Greene, Anthony Homer, Carlos Diaz, Qi Zhang and Min Zhang. Without their inspiration and help this paper could not have been written. REFERENCES: SAS is a registered trademark of SAS Institute Inc. FDA (1999a), Guidance for Industry Providing Regulatory Submissions to the Center for Biological Evaluation and Research (CBER) in Electronic Format Biologics Marketing Applications, Food and Drug Administration, November 1999. FDA (1999b) Providing Regulatory Submissions in Electronic Format NDAs, Food and Drug Administration, January 1999. FDA (2000), Application to Market a New Drug, Biologic, or an Antibiotic Drug for Human Use, April, 2000. CONTACT INFORMATION Your comments and questions are valued and encouraged Shawn Wang MedXview, Inc 124 Mt. Auburn Street Cambridge, MA 02138 Work Phone: (617) 576-5855 Fax: (617) 661 8535 Email: shawnwang@medxview.com Web: www.medxview.com