Prove QC Quality Create SAS Datasets from RTF Files Honghua Chen, OCKHAM, Cary, NC ABSTRACT Since collecting drug trial data is expensive and affects human life, the FDA and most pharmaceutical company SOPs require all datasets and TLFs to be checked by independent secondary QC programmers. Sometimes, comparing hundreds or even thousands pages of tables and listings is tedious and consumes a huge amount of QC programmer s time. This paper outlines a process flow to replace a primary SAS program with ODS RTF statements, create a temporary SAS program, execute the program to create a temporary RTF file [3], extract data from that RTF file to create primary final SAS dataset, and do the proc compare with QC final SAS dataset. The listing produced from proc compare can be saved for auditing purposes. The process can improve QC performance, reduce validation processing and paperwork, and finally prove QC quality. The full program, sample input RTF file and output dataset are available as appendices. INTRODUCTION There are two types of RTF outputs existing in the pharmaceutical industry. The first one involves converting SAS output files (.lst) to RTF and doing some post-processing. There are various techniques available through internet such as out2rtf (search support.sas.com) created for this type of output by David Ward dating back to May, 1999. Variations of that macro have played important roles for automation of post processing for.lst files. The other one is called in-text RTF which is created from the SAS ODS RFT function. This two dimensional table format is preferred by medical writers because copy and pasting the table will not cause values to be shifted between columns when this process is applied. A SAS dataset is also a two-dimensional table, so extracting an RTF file to a SAS dataset seems like a logical choice. THE ORIGINAL PROC REPORT PROCESS A pre-process (%prtsetup) before the proc report and a post-process (%pageprt) after the proc report are required for most pharmaceutical reporting systems. The pre-process sets up the destination of the report, font, page layout, etc... While the post-process adds page numbers and formats to the report according to the company s standards. See the following: libname testdata "/u01/home/hchen/company/drugname/protocols"; %prtsetup; proc report data=testdata.final nowd spacing=1 split='*' headline; columns ('--' subjid bthdt age sex newrace ethnic height weight); %pageprt; define subjid / order width=10; define bthdt / display 'Birth Date' width=10; define age / display width=7 'Age*(years)'; define sex / display width=6 'Sex'; define newrace / display width=9 flow 'Race'; define ethnic / display width=10 flow 'Ethnicity'; define height / width=6 'Height*(cm)'; define weight / width=6 'Weight*(kg)'; title2 'Listing of demographics and baseline characteristics'; title3 'Full Analysis Set'; 1
THE MODIFIED PROC REPORT PROCESS Using Perl search and replace [1], we replace the pre-process (%prtsetup) with options and ods rtf to set up a new temporary destination and the simplest format for the rtf file. Then we post-process (%pageprt) with ods rtf close. See the following: %let pgm_folderx=/u01/home/hchen/company/drugname/protocols; %let slashx =/; libname testdata "/u01/home/hchen/company/drugname/protocols"; ods listing close; options nodate nonumber ORIENTATION=LANDSCAPE device=sasemf; ods rtf file="&pgm_folderx.&slashx.l-dm-temp.rtf"; proc report ods rtf close; ods listing; EXTRACTING RTF OUTPUT TO SAS DATASET We do not need to understand RTF tags [2] and RTF parse [5] to do the job. Simply looking through the text within the RTF file and eliminating all unrelated rows and transforming the file to create the primary final dataset with variables col1 to coln and titles, column header, and footnotes dataset is all that is needed for the new method. See appendices II (source code) from data rtf_temp1 to data rtf_temp4. COMPLETING THE TASK The QC program creates the QC final dataset containing variables col1 to coln, and compares it to the primary final dataset using PROC COMPARE. proc compare base=primary_final compare=qc_final listall; proc compare base=primary_tit_colhd_ft compare=qc_tit_colhd_ft listall; CONCLUSION We discussed a method of producing a dataset from an rtf file. While the program is created for one company, with little emphasis on modification, transplanting the program to work for many is an achievable and possible goal. Former Chinese leader Deng Xiaoping [4] uttered his most famous quotation: "I don't care if it's a white cat or a black cat. It's a good cat as long as it catches mice." This quote can be interpreted here to mean that being creative, more effective and serving the objectives of the client is more important than whether one follows traditional ideology in this instance. REFERENCES [1] Shuguang Zhang Use Perl Regular Expressions in SAS [2] Sean M. Burke The Universal Document Format RTF Pocket Guide [3] BIOGEN IDEC SMART System Users Guide [4] WIKIPEDIA Deng Xiaoping [5] Duong Tran %RTFparser CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Honghua Chen OCKHAM 8000 Regency Parkway, Suite 360 Cary, North Carolina, 27518 Phone: 4439380592 Email: hchen@ockham.com Web: www.ockham.com 2
ACKNOWLEDGMENTS I would like to thank the Biogen Idec programming team in RTP, NC for their helpful suggestions and assistance in testing the program presented in this paper. I would also like to thank Adam Gilbert and Juliet Allen for their encouragement and assistance in reviewing the macro. SAS is a registered trademark of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. DISCLAIMER All code contained in this paper is provided as an AS IS basis, without warranty. The author makes no representation, or warranty, either or implied, with respect to the programs, their quality, accuracy, or fitness for a specific purpose. Therefore, the author shall have no liability to you or any other person or entity with respect to any liability, loss, or damage caused or alleged to have been caused directly or indirectly by the programs provided in this paper. This includes, but is not limited to, interruption of service, loss of data, loss of profits, or consequential damages from the use of these programs. 3
APPENDIX I (RTF TABLE): APPENDIX II (SOURCE CODE): %macro get_final(inpgm_folder=,pgm_folder=,pgm_name=,s_string=, e_string=,debug=no); *** Output a simple Perl program to SAS dataset qc_ods_rft ***; options NOQUOTELENMAX; data qc_ods_rtf; length pgm_code $200; pgm_code='open (INFILE,"' "&inpgm_folder" '/' "&pgm_name" '.sas" ) or die "can not open the input file";'; pgm_code='open (OUTFILE,">' "&pgm_folder" '/' "&pgm_name" '-temp.sas" ) or die "can not open output file";'; pgm_code='select (OUTFILE);'; 4
pgm_code='while (<INFILE>){'; pgm_code='s/' "&s_string" '/ods listing close; options nodate nonumber' ' ORIENTATION=LANDSCAPE device=sasemf; '; pgm_code='ods rtf file="&pgm_folderx.' '&slashx.' "&pgm_name" '-temp.rtf";/g;'; pgm_code='s/' "&e_string" '/ods rtf close; ods listing; /g;'; pgm_code='print; }'; pgm_code='close (INFILE);'; pgm_code='close (OUTFILE);'; *** Call BIOGEN SMART utility macro putpgm to write out perl program qc_ods_rtf.sas ***; %let exe_lib =&pgm_folder/; %include "/biostats/macros/smart/putpgm.sas"; %putpgm(qc_ods_rtf); *** Execute the perl program qc_ods_rft.sas to create a modified primary program &pgm_name.-temp.sas ***; x "perl &pgm_folder/qc_ods_rtf.sas"; data temp000; length pgm_code $200; pgm_code='%let pgm_folderx=' "&pgm_folder" ';'; pgm_code='%let slashx =/;'; %putpgm(temp000); x "cat &pgm_folder./temp000.sas &pgm_folder./&pgm_name.-temp.sas > &pgm_folder./&pgm_name.-temp2.sas" *** Execute the modified primary program &pgm_name.-temp.sas ***; %include "&pgm_folder./&pgm_name.-temp2.sas"; *** Read in the rtf file &pgm_name.-temp.rtf created from &pgm_name.-temp.sas ***; data rtf_input; infile "&pgm_folder./&pgm_name.-temp.rtf" delimiter='00'x MISSOVER DSD lrecl=32767 firstobs=1; format f1 $500.; input f1 $ ; *** Extract text from cells ***; data rtf_temp1; set rtf_input; 5
length text $1000; *** Keep all cells from the table ***; if ( index(f1,'\cell}') or index(f1,'\row}') or index(f1,'\trowd\') ) > 0 ; *** Delete column titles ***; if ( index(f1,'\b\') ) = 0; *** Find start position of text ***; pos1 = index(f1,'{'); *** Find end position of text ***; pos2 = index(f1,'\cell}'); lengthx = pos2 - pos1-1; if pos1 ^= 0 and pos2 ^= 0 then do; if pos2 = pos1 + 1 then text = ' '; else text = substr(f1,pos1+1, lengthx); data rtf_input1; set rtf_input; length f1x $1000; retain delx 0 f1x ''; if (index(f1,'\b\') > 0 and index(f1,'\line}') > 0) then do; f1x = trim(left(f1)); delx = 1; delete; if delx = 1 and index(f1,'\line}') > 0 then do; f1x = trim(left(f1x)) trim(left(f1)); delete; if delx = 1 and index(f1,'\cell}') > 0 then do; f1x = trim(left(f1x)) trim(left(f1)); delx = 0; f1 = tranwrd(f1x,'{\line}',' '); f1 = tranwrd(f1,'\~',' '); f1x = ' '; data rtf_tit_foot1; set rtf_input1; length text $1000; retain group 0 ; if index(f1,'\bkmkend') > 0 then group = 2000; if index(f1,'\header') > 0 then group = 1000; if index(f1,'\footer') > 0 then group = 3000; group = group + 1; *** Keep all cells from the table ***; *** keep column titles ***; if ((index(f1,'\b\') and index(f1,'\cell}')) or index(f1,'\row}') or index(f1,'\trowd\') ) > 0 ; 6
*** Find start position of text ***; pos1 = index(f1,'{'); *** Find end position of text ***; pos2 = index(f1,'\cell}'); lengthx = pos2 - pos1-1; if pos1 ^= 0 and pos2 ^= 0 then do; if pos2 = pos1 + 1 then text = ' '; else text = substr(f1,pos1+1, lengthx); if text = '' or index(text,' ') > 0 then delete; proc sort nodupkey; by group text; data rtf_tit_foot2; set rtf_input; length text $1000; retain keepx 0 group 3000 ; *** keep footnotes created by macro setft ***; if index(f1,' ') > 0 and index(f1,'\line}') > 0 then do; keepx = 1; group = 3000; if keepx = 1 then group = group + 1; if keepx = 1 and index(f1,'\cell}') > 0 then keepx = 0; text = tranwrd(f1,'{\line}',' '); text = tranwrd(text,'\~',' '); if keepx = 0 then delete; if keepx = 1 and group = 3001 then delete; if text = '' then delete; proc sort nodupkey; by group text; *** Find the maximum column number ***; data rtf_temp2(drop= tot) tot_temp(keep = grp tot); set rtf_temp1; retain grp 0 col -1; if index(f1,'\trowd\') > 0 then do; grp = grp + 1; col = -1; col = col + 1; output rtf_temp2; if index(f1,'\row') > 0 then do; tot = col; output tot_temp; data rtf_temp3; 7
merge rtf_temp2(in=a) tot_temp(in=b); by grp; if a; proc sql noprint; select max(tot) into :max_tot from tot_temp; quit; %put ****&max_tot****; *** Delete titles and footnotes cells ***; data rtf_temp4; set rtf_temp3; if tot = &max_tot; if col = 0 or col = &max_tot then delete; *** create SAS dataset with variable col1 to coln ***; proc transpose data=rtf_temp4 out=rtf_temp5 prefix=col; by grp; id col; var text; *** Get the result ***; data primary_tit_colhd_ft; set rtf_tit_foot1 rtf_tit_foot2; if index(text,'source: ') > 0 then delete; keep text group; data primary_final; set rtf_temp5; *** delete rows if all cells are blank ***; length col $1000; col = %do i = 1 %to &max_tot - 2; trim(left(col&i)) %trim(left(col%eval(&max_tot - 1))); if col = '' then delete; drop grp _name_ col; %if %upcase(&debug)=no %then %do; *** Delete perl prigram qc_ods_rtf.sas ***; x "rm &pgm_folder./qc_ods_rtf.sas"; *** Delete the modified primary program &pgm_name.-temp.sas ***; x "rm &pgm_folder./&pgm_name.-temp.sas"; x "rm &pgm_folder./&pgm_name.-temp2.sas"; x "rm &pgm_folder./temp000.sas"; 8
*** Delete the rtf file &pgm_name-temp.rtf ***; x "rm &pgm_folder./&pgm_name.-temp.rtf"; proc datasets library=work memtype=data nolist nowarn; delete rtf_temp1 rtf_temp2 rtf_temp3 rtf_temp4 rtf_temp5 qc_ods_rtf tot_temp rtf_input rtf_input1 rtf_tit_foot1 rtf_tit_foot2 temp000; quit;; % %mend get_final; *** Call get_final macro to create SAS dataset PRIMARY_FINAL ***; %include "/u01/home/hchen/macros/get_final.sas"; %get_final(inpgm_folder=%str(/u01/home/hchen/company/drugname/protocols ), pgm_folder=%str(/u01/home/hchen/company/drugname/protocols), pgm_name=l-dm, s_string=%nrstr(%prtsetup;), e_string=%nrstr(%pageprt;),debug=yes); * Use the backslash (\) character to escape any type of character ; * that might interfere with perl code. Use '%\*' if you want to ; * replace '%*' in SAS code; %get_final(inpgm_folder=%str(/u01/home/hchen/company/drugname/protocols ), pgm_folder=%str(/u01/home/hchen/company/drugname/protocols), pgm_name=l-dm2, s_string=%nrstr(%\*prtsetup;), e_string=%nrstr(%\*pageprt;),debug=no); *** Begin of QC program ***; *** End of QC program ***; data qc_final; length col1-col8 $1000; set final; col1 = subjid; col2 = analset; col2 = brthdtc; col3 = agex; col4 = sex; col5 = racex; col6 = ethnic; col7 = htx; col8 = wtx; keep col1-col8; 9
APPENDIX III (SAS DATASETS): 10