Paper FC07 A Useful Macro for Converting SAS Data sets into SAS Transport Files in Electronic Submissions Xingshu Zhu and Shuping Zhang Merck Research Laboratories, Merck & Co., Inc., Blue Bell, PA 19422 ABSTRACT In 1999, the FDA issued a guidance imposing certain requirements on electronic submissions. Under this guidance, an analysis data set must be converted into a SAS transport file. As a result, it is important for programmers to have a convenient and reliable tool to perform this conversion. We have developed such a tool in the form of a utility macro. This macro can also be used to efficiently and effectively split data sets when they exceed the FDA-imposed size limitation. We have tested this macro in many projects, and it has proved to be a very efficient and advantageous tool in FDA submissions. KEYWORDS Data set size, PROC CONTENTS, SAS transport file, FDA submission INTRODUCTION The use of electronic submissions has become one of fastest growing trends in the pharmaceutical industry. One very important part of the submission package is the analysis data sets. The FDA guidance on electronic submissions requires an analysis data set to be submitted in the form of a SAS transport file that may not exceed 25 MB in size ( Guidance for Industry - Providing Regulatory Submissions in Electronic Format - NDAs" (January 1999)). The guidance also requires that the transport file names include the three-character extension.xpt in order to be compatible with FDA desktop setup and usage. If a data set exceeds the FDA size limitation, it must be split, meaning that more than one transport file will be needed to present the data. In order to comply with the FDA guidance and to accommodate FDA statistical reviewers, we have adopted some specific additional guidelines for preparing SAS transport files for electronic submission. By following these guidelines, we have been able to create user friendly transport files that have been well accepted by the FDA. The additional guidelines are: SAS transport files must not exceed 62999 records (observations); when a large data set must be split into multiple transport files, the information should be divided in a specific manner and the resulting files should be named according to specific rules. It is quite easy for SAS programmers to ignore or misunderstand the detailed requirements involved in converting data sets into SAS transport files. Under the incorrect assumption that all data sets automatically satisfy these requirements, programmers often convert data sets directly into transport files without checking for compliance with the limitations on data set size and number of records. Sometimes the data sets are simply inspected from the Windows Explorer after they are created, and then the data set programs are used to fix the problem when the required limits are exceeded. This process can be very time consuming and inconvenient. Furthermore, since the format and size requirements for the data sets are very specific, it is important to follow a well-established procedure for splitting the files that exceed the size limits. In this paper, we describe a reliable macro tool that
we have developed in response to this situation. This macro automatically performs the conversions needed to make the data sets compatible with both the FDA guidance and the additional guidelines. PROCEDURES This tool that we developed for use in data set submissions is a utility macro called %SetXpt. This macro consists of four procedures carried out by four sub-macros, %DsSize, %Ds_obs, %sas2xpt and %splitds, as shown below. %macro SetXpt(ds=, xptdir=); %DsSize(ds=&ds); %Ds_obs(ds=&ds); %if &ds_mb <= 25 and &n_obs <= 62999 %then %sas2xpt(ds=&ds, xptdir=&xptdir); %else %splitds(ds=&ds, xptdir=&xptdir); %mend SetXpt; Each sub-macro is designed to carry out one specific task in the process of bringing a data set into compliance with both FDA and our additional guidelines. For example, the first macro, %DsSize, is used to perform the estimation on the size of the input data set, which is the first step in the procedure of data set conversion. The second macro, %Ds_obs, is used to calculate the number of observations (records) in the data set. The third macro, %sas2xpt, converts the input SAS data set to an xpt file if the data set is within the limits of the requirements. The last macro, %splitds, which is also the most complicated one, performs the tedious and difficult work of splitting the data set. When it has been determined that the size or record requirements have been violated, this macro can split the data set according to the specific name conventions and rules. It then converts the resulting SAS data sets to xpt files and exports them. Each of these macros is described in detail below. (1). Calculating the size of a data set The important point about data set size estimation is that it must be performed within a program before the final data set or transport file is created. Programmers should not try to make the size estimation after viewing the final data set or transport file from the Windows Explorer. The following SAS codes demonstrate how the size of a data set is estimated with the library information extracted from the PROC CONTENTS procedure. The input SAS data set is represented by the parameter ds. The macro %DsSize outputs a global macro variable, called ds_mb, that holds the size of the input SAS data set in megabytes (MB). %macro DsSize(ds= /* one or two level input SAS data set */); %global ds_mb; %if %index(&ds,%str(.)) ne 0 %then %let ds_ = %substr(&ds,%eval(%index(&ds,%str(.))+1)); %else %let ds_ = &ds; ods output "Library Members"=LibInfo; proc contents data=&ds memtype=data; ods output close; data _null_; set LibInfo; if Memname eq "%upcase(&ds_)" then do;
call symput('ds_mb', round(file_size/1048576,0.01)); %mend DsSize; (2). Estimating the number of Observations in a SAS data set The number of observations in a data set can be simply estimated with the SAS function ATTRN, as demonstrated by the macro %Ds_Obs. The macro %Ds_Obs outputs a global macro variable, n_obs, that holds the number of observations in the input SAS data set (ds). %macro Ds_Obs(ds= /* one or two level input SAS data set */); %global n_obs; %let dsid = %sysfunc(open(&ds)); %let n_obs= %sysfunc(attrn(&dsid,nobs)); %let dsid = %sysfunc(close(&dsid)); %m (3).Converting a SAS data set to a SAS transport file when it is within the required size and record limits There are normally two ways to convert a SAS data set into a SAS transport file: one is with the engine XPORT and the procedure PROC COPY, and the other is with the engine XPORT and the SAS data step. The macro %sas2xpt adopts the second approach. It uses the engine XPORT and the SAS data step to convert the SAS data set ds into the SAS transport file &xptname..xpt, which it stores under the directory xptdir. %macro sas2xpt(ds=, xptdir=); %local outfile; %if %index(&ds,.)>0 %then %let xptname=%substr(&ds,%eval(%index(&ds,.)+1)); %else %let xptname=&ds; %let xptname=%substr(&xptname,1,%sysfunc(min(8,%length(&xptname)))); %let outfile=%sysfunc(compress(&xptdir\&xptname..xpt)); libname yyy xport "&outfile"; data yyy.&xptname; set &ds; libname yyy clear; %mend sas2xpt; (4). Splitting a SAS data set into multiple SAS transport files if it exceeds the required limits If a data set surpasses the limits by being more than 25 MB in size or containing more than 62999 records, it must be split into smaller groups. The strategy employed here is that, when a data set is split, all of the data on a particular patient should be contained in one file, and, if possible, data on all patients from the same study site should be contained within the same file. Since this strategy allows for better organization of the split data, it facilitates the work of FDA reviewers. In addition, we have applied specific naming conventions as a method for organizing the split files. The files split from the same data set should use the same root name ending with a number that increases sequentially for each file. If the total number of split files is less than ten, then the root name for each file contains up to
seven letters from the original data set file name and ends with a number from 1 though 9, respectively. Assume, for example, that the data set DEMODATA exceeds the size limit and is split into three data sets. The split data sets will be named DEMODAT1, DEMODAT2 and DEMODAT3. However, if the total number of split files is ten or greater, then the root name for each file contains up to six letters from the original data set file name and ends with a number from 01 to 99, respectively. (We assume that the number of split files will not exceed 99.) For example, assume that the data set LABADATA exceeds the size limit and is split into twelve data sets. The resulting split data sets will be named LABADA01, LABADA02 LABADA10, LABADA11 and LABADA12. Finally, after a data set is split into multiple sub-data sets with appropriate root names, these sub-data sets must be converted into xpt files. The entire process of splitting a SAS data set into multiple SAS transport files is obviously a complicated one. In order to complete this process, we have developed the following SAS code, which is divided into five steps. Step 1. Estimate the number of observations per output data set. In order to estimate the number of observations per output data set, the PROC CONTENTS procedure is used to obtain the Engine/Host information about the input data set (ds). The following SAS code shows how the number of observations per output data set is estimated using Data Set Page Size and Max Obs per Page. The Obs in First Data Page and Max Obs per Page are used to make sure that the size of the output data set does not exceed the limit of 25 MB. The macro variable ObsPerDs holds the number of observations per output data set and will be used later. ods output "Engine/Host Information"=hostinfo; proc contents data=&ds; ods output close; data _null_; set hostinfo end=eof; retain pagesize maxobspp obspage1; if Label1 eq "Data Set Page Size" if Label1 eq "Max Obs per Page" if Label1 eq "Obs in First Data Page" then pagesize = cvalue1; then maxobspp = cvalue1; then obspage1 = cvalue1; if eof then do; ObsPerDs= obspage1+maxobspp*(floor(25*1024*1024/pagesize)-1); call symput('obsperds',compress(put(obsperds,8.0))); Step 2. Calculate the total number of observations in each study site. The second step is to obtain the observation counts for each study site in order to judge if the study site can be placed into one data set. proc sort data=&ds out=dscopy; by Study_Site Patient_ID; data SiteCount(keep=Study_Site Site_Count); set DsCopy; retain Site_Count; by Study_Site; if first.study_site then Site_Count=0; Site_Count + 1;
if last.study_site then output; data DsCopy; merge DsCopy SiteCount; by Study_Site; Step 3. Assign Set_ID to each observation in the data set. The variable Set_ID is used to label each observation in the input data set in order to determine in which output data set the observation will be stored. The macro variable tot_ds holds the total number of output data sets to be created after the input data set is split. For example, if an input data set is split into three output data sets, then the variable Set_ID will be given the values of 1, 2 and 3, respectively, and the tot_ds will be 3. data SetID(keep=Study_Site Patient_ID Set_ID); set DsCopy end=eof; by Study_Site Patient_ID; retain _IDCnt _SetCnt 0 Set_ID 1; if first.study_site then do; if Site_Count <= &ObsPerDs and _SetCnt + Site_Count > &ObsPerDs then do; Set_ID + 1; _SetCnt=0; if first.patient_id then _IDCnt=0; _IDCnt+1; if last.patient_id then do; _SetCnt + _IDCnt; if _SetCnt > &ObsPerDs then do; Set_ID + 1; _SetCnt=_IDCnt; output; if eof then do; call symput("tot_ds",trim(left(put(set_id,3.)))); Step 4. Determine the root names of the multiple output data sets. In accordance with the naming conventions, this step determines the root names of the multiple output data sets that are created after the input data set is split. As previously explained, if the total number of split data sets is less than ten, then the root name for each split data set consists of up to seven letters of the file name of the input data set, ending with a number from 1 through 9, respectively. If the total number of split data sets is between the numbers 10 and 99, then the root name for each split data set consists of up to six letters of the file name of the input data set, ending with a number from 01 through 99, respectively. %let rt=%scan(%substr(&ds,%eval(%index(&ds,.)+1)),1,.); %let root=%substr(&rt,1,%sysfunc(min(%length(&rt),%eval(8-%length(&tot_ds))))); Step 5. Split the input data set (ds) into multiple data sets and convert them to xpt files.
The following codes perform the process of splitting the large data set into multiple sub-data sets assigned by the macro variable Set_ID, which was calculated above in Step 3. A DO loop will iterate until the total number of split data sets, tot_ds, is reached. Once the multiple files are established individually, the macro %sas2xpt is called for each split data set to output the multiple xpt files. %do j=1 %to &tot_ds; %if (&tot_ds >= 10) and (&j < 10) %then %let xptname=%sysfunc(compress(&root.0&j)); %else %let xptname=%sysfunc(compress(&root.&j )); data &xptname(drop=site_count Set_ID); merge DsCopy SetID; by Study_Site Patient_ID; if Set_ID=&j; %sas2xpt(ds=&xptname, xptdir=&xptdir); proc datasets nolist; delete &xptname; quit; % proc datasets nolist; delete DsCopy SiteCount SetID; quit; APPLICATION We used an actual SAS data set, "vital.sas7bdat," to test the macro %SetXpt.sas in the Windows NT environment. Since we determined the size of this SAS data set to be around 134 MB, which exceeds the limit of 25 MB, it was split into several sub-data sets. These sub-data sets were ultimately converted into xpt transport files. The output from running the macro %SetXpt.sas on the data set vital.sas7bdat is displayed as follows: libname datadir "C:\data_analysis"; %macro SetXpt(ds=datadir.vital, xptdir=vital); The following results are printed in the Windows Explorer screen: Clearly, the data set vital.sas7bdat has been split into six subsets, named vital1.xpt, vital2.xpt vital6.xpt, respectively. Each of these files is smaller than the maximum of 25 MB because we followed the strategy of keeping the same study site and patient in the same file. Macro %SetXpt has also been tested on many larger data sets, including lab data that has been split in up to 26 subsets. The results of these tests indicate that the macro works reliably for all different types of large data sets.
CONCLUSION How to accurately and efficiently estimate the size of a data set and convert it into a SAS transport file is a very important issue to resolve prior to an FDA submission. If a data set exceeds the FDA size limitation, it must be split into two or more sub-data sets. In this paper, we presented a utility macro that not only converts a SAS data set into a transport file, but also estimates the size and number of records of the data set and determines whether the limits have been exceeded. If necessary, this macro splits the data set and automatically assigns names to the multiple files it creates, according to specific strategies and naming conventions. This macro has been tested on SAS data sets in many projects for FDA submission and has proved to work accurately and efficiently, thus providing another valuable convenience for use in electronic submissions. REFERENCES SAS Macro Language Reference, First Edition Copyright 1997 by SAS Institute Inc., Gary, NC, USA ACKNOWLEDGEMENTS The authors would like to thank Donna Usavage, Allan Glaser and Jodi Benjamin for their valuable suggestions and comments. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: Author Name: Xingshu Zhu Company Merck &Co., Inc. Address : UNA-102, 785 Jolly road, Blue Bell, PA 19422 Work phone: 484 344 3572 Fax: 484 344 7105 Email: xingshu_zhu@merck.com Author Name: Shuping Zhang Company Merck &Co., Inc. Address : 10 Sentry Parkway, Blue Bell, PA 19422 Work phone: 484 344 3496 Fax: 484 344 7105 Email: Shuping_zhang@merck.com