A Useful Macro for Converting SAS Data sets into SAS Transport Files in Electronic Submissions

Similar documents
So Much Data, So Little Time: Splitting Datasets For More Efficient Run Times and Meeting FDA Submission Guidelines

A Tool to Reduce the Time Used in the Preparation of the Statistical Review Aid Used for Electronic Submission

Paper PO06. Building Dynamic Informats and Formats

PhUSE US Connect 2018 Paper CT06 A Macro Tool to Find and/or Split Variable Text String Greater Than 200 Characters for Regulatory Submission Datasets

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

SAS Programming Techniques for Manipulating Metadata on the Database Level Chris Speck, PAREXEL International, Durham, NC

Amie Bissonett, inventiv Health Clinical, Minneapolis, MN

A SAS Macro to Create Validation Summary of Dataset Report

Matt Downs and Heidi Christ-Schmidt Statistics Collaborative, Inc., Washington, D.C.

Program Validation: Logging the Log

Tracking Dataset Dependencies in Clinical Trials Reporting

PharmaSUG Paper TT11

Creating Macro Calls using Proc Freq

1 Files to download. 3 Macro to list the highest and lowest N data values. 2 Reading in the example data file

An Efficient Tool for Clinical Data Check

Get Started Writing SAS Macros Luisa Hartman, Jane Liao, Merck Sharp & Dohme Corp.

Quick Data Definitions Using SQL, REPORT and PRINT Procedures Bradford J. Danner, PharmaNet/i3, Tennessee

Edwin Ponraj Thangarajan, PRA Health Sciences, Chennai, India Giri Balasubramanian, PRA Health Sciences, Chennai, India

The Path To Treatment Pathways Tracee Vinson-Sorrentino, IMS Health, Plymouth Meeting, PA

An Alternate Way to Create the Standard SDTM Domains

Exporting & Importing Datasets & Catalogs: Utility Macros

Clinical Data Visualization using TIBCO Spotfire and SAS

%check_codelist: A SAS macro to check SDTM domains against controlled terminology

Paper A Simplified and Efficient Way to Map Variable Attributes of a Clinical Data Warehouse

Paper An Automated Reporting Macro to Create Cell Index An Enhanced Revisit. Shi-Tao Yeh, GlaxoSmithKline, King of Prussia, PA

PhUSE US Connect 2019

Making a List, Checking it Twice (Part 1): Techniques for Specifying and Validating Analysis Datasets

Let s Get FREQy with our Statistics: Data-Driven Approach to Determining Appropriate Test Statistic

SAS Macro Dynamics: from Simple Basics to Powerful Invocations Rick Andrews, Office of Research, Development, and Information, Baltimore, MD

PharmaSUG Paper AD03

SAS Application to Automate a Comprehensive Review of DEFINE and All of its Components

PDF Multi-Level Bookmarks via SAS

Efficient Processing of Long Lists of Variable Names

A Mass Symphony: Directing the Program Logs, Lists, and Outputs

Utilizing SAS for Cross- Report Verification in a Clinical Trials Setting

Venkata N Madhira Senior SAS Programmer, Shionogi Inc.

Create a Format from a SAS Data Set Ruth Marisol Rivera, i3 Statprobe, Mexico City, Mexico

SAS Drug Development Program Portability

CC13 An Automatic Process to Compare Files. Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

MedDRA Dictionary: Reporting Version Updates Using SAS and Excel

Cramped for Drive Space? Save Space with the Auto-Compress Macro

Prove QC Quality Create SAS Datasets from RTF Files Honghua Chen, OCKHAM, Cary, NC

PhUSE Paper CC07. Slim Down Your Data. Mickael Borne, 4Clinics, Montpellier, France

What Do You Mean My CSV Doesn t Match My SAS Dataset?

Using GSUBMIT command to customize the interface in SAS Xin Wang, Fountain Medical Technology Co., ltd, Nanjing, China

PharmaSUG Paper SP04

Posters. Paper

%Addval: A SAS Macro Which Completes the Cartesian Product of Dataset Observations for All Values of a Selected Set of Variables

Purchase this book at

Paper SBC-121. %* include start of macro code. %put --- Start of %upcase(&sysmacroname) macro;

Customizing SAS Data Integration Studio to Generate CDISC Compliant SDTM 3.1 Domains

The Output Bundle: A Solution for a Fully Documented Program Run

Performance Considerations

Run your reports through that last loop to standardize the presentation attributes

Clinical Data Model and FDA Submissions

Arthur L. Carpenter California Occidental Consultants, Oceanside, California

Creating Case Report Tabulations (CRTs) for an NDA Electronic Submission

Using a Control Dataset to Manage Production Compiled Macro Library Curtis E. Reid, Bureau of Labor Statistics, Washington, DC

Advanced Visualization using TIBCO Spotfire and SAS

SDTM Attribute Checking Tool Ellen Xiao, Merck & Co., Inc., Rahway, NJ

How to handle different versions of SDTM & DEFINE generation in a Single Study?

Logging the Log Magic: Pulling the Rabbit out of the Hat

Using SAS Files. Introduction CHAPTER 5

Developing Data-Driven SAS Programs Using Proc Contents

ABSTRACT DATA CLARIFCIATION FORM TRACKING ORACLE TABLE INTRODUCTION REVIEW QUALITY CHECKS

PharmaSUG Paper AD21

ODS/RTF Pagination Revisit

PharmaSUG Paper TT10 Creating a Customized Graph for Adverse Event Incidence and Duration Sanjiv Ramalingam, Octagon Research Solutions Inc.

PharmaSUG China Paper 059

PharmaSUG Paper DS16

Using SAS software to fulfil an FDA request for database documentation

Lex Jansen Octagon Research Solutions, Inc.

Validation Summary using SYSINFO

EXAMPLE 3: MATCHING DATA FROM RESPONDENTS AT 2 OR MORE WAVES (LONG FORMAT)

Knock Knock!!! Who s There??? Challenges faced while pooling data and studies for FDA submission Amit Baid, CLINPROBE LLC, Acworth, GA USA

A Cross-reference for SAS Data Libraries

Better Metadata Through SAS II: %SYSFUNC, PROC DATASETS, and Dictionary Tables

One Project, Two Teams: The Unblind Leading the Blind

BreakOnWord: A Macro for Partitioning Long Text Strings at Natural Breaks Richard Addy, Rho, Chapel Hill, NC Charity Quick, Rho, Chapel Hill, NC

LST in Comparison Sanket Kale, Parexel International Inc., Durham, NC Sajin Johnny, Parexel International Inc., Durham, NC

Dataset-XML - A New CDISC Standard

Simplifying the Sample Design Process with PROC PMENU

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

1. Join with PROC SQL a left join that will retain target records having no lookup match. 2. Data Step Merge of the target and lookup files.

Paper SAS Programming Conventions Lois Levin, Independent Consultant, Bethesda, Maryland

Keeping Track of Database Changes During Database Lock

Comparison of different ways using table lookups on huge tables

The Benefits of Traceability Beyond Just From SDTM to ADaM in CDISC Standards Maggie Ci Jiang, Teva Pharmaceuticals, Great Valley, PA

The Impossible An Organized Statistical Programmer Brian Spruell and Kevin Mcgowan, SRA Inc., Durham, NC

SAS Programming Conventions Lois Levin, Independent Consultant

A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys

Coders' Corner. Paper Scrolling & Downloading Web Results. Ming C. Lee, Trilogy Consulting, Denver, CO. Abstract.

Automated Checking Of Multiple Files Kathyayini Tappeta, Percept Pharma Services, Bridgewater, NJ

A Few Quick and Efficient Ways to Compare Data

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

Paper B GENERATING A DATASET COMPRISED OF CUSTOM FORMAT DETAILS

Tales from the Help Desk 6: Solutions to Common SAS Tasks

= %sysfunc(dequote(&ds_in)); = %sysfunc(dequote(&var));

esubmission - Are you really Compliant?

Create Metadata Documentation using ExcelXP

Transcription:

Paper FC07 A Useful Macro for Converting SAS Data sets into SAS Transport Files in Electronic Submissions Xingshu Zhu and Shuping Zhang Merck Research Laboratories, Merck & Co., Inc., Blue Bell, PA 19422 ABSTRACT In 1999, the FDA issued a guidance imposing certain requirements on electronic submissions. Under this guidance, an analysis data set must be converted into a SAS transport file. As a result, it is important for programmers to have a convenient and reliable tool to perform this conversion. We have developed such a tool in the form of a utility macro. This macro can also be used to efficiently and effectively split data sets when they exceed the FDA-imposed size limitation. We have tested this macro in many projects, and it has proved to be a very efficient and advantageous tool in FDA submissions. KEYWORDS Data set size, PROC CONTENTS, SAS transport file, FDA submission INTRODUCTION The use of electronic submissions has become one of fastest growing trends in the pharmaceutical industry. One very important part of the submission package is the analysis data sets. The FDA guidance on electronic submissions requires an analysis data set to be submitted in the form of a SAS transport file that may not exceed 25 MB in size ( Guidance for Industry - Providing Regulatory Submissions in Electronic Format - NDAs" (January 1999)). The guidance also requires that the transport file names include the three-character extension.xpt in order to be compatible with FDA desktop setup and usage. If a data set exceeds the FDA size limitation, it must be split, meaning that more than one transport file will be needed to present the data. In order to comply with the FDA guidance and to accommodate FDA statistical reviewers, we have adopted some specific additional guidelines for preparing SAS transport files for electronic submission. By following these guidelines, we have been able to create user friendly transport files that have been well accepted by the FDA. The additional guidelines are: SAS transport files must not exceed 62999 records (observations); when a large data set must be split into multiple transport files, the information should be divided in a specific manner and the resulting files should be named according to specific rules. It is quite easy for SAS programmers to ignore or misunderstand the detailed requirements involved in converting data sets into SAS transport files. Under the incorrect assumption that all data sets automatically satisfy these requirements, programmers often convert data sets directly into transport files without checking for compliance with the limitations on data set size and number of records. Sometimes the data sets are simply inspected from the Windows Explorer after they are created, and then the data set programs are used to fix the problem when the required limits are exceeded. This process can be very time consuming and inconvenient. Furthermore, since the format and size requirements for the data sets are very specific, it is important to follow a well-established procedure for splitting the files that exceed the size limits. In this paper, we describe a reliable macro tool that

we have developed in response to this situation. This macro automatically performs the conversions needed to make the data sets compatible with both the FDA guidance and the additional guidelines. PROCEDURES This tool that we developed for use in data set submissions is a utility macro called %SetXpt. This macro consists of four procedures carried out by four sub-macros, %DsSize, %Ds_obs, %sas2xpt and %splitds, as shown below. %macro SetXpt(ds=, xptdir=); %DsSize(ds=&ds); %Ds_obs(ds=&ds); %if &ds_mb <= 25 and &n_obs <= 62999 %then %sas2xpt(ds=&ds, xptdir=&xptdir); %else %splitds(ds=&ds, xptdir=&xptdir); %mend SetXpt; Each sub-macro is designed to carry out one specific task in the process of bringing a data set into compliance with both FDA and our additional guidelines. For example, the first macro, %DsSize, is used to perform the estimation on the size of the input data set, which is the first step in the procedure of data set conversion. The second macro, %Ds_obs, is used to calculate the number of observations (records) in the data set. The third macro, %sas2xpt, converts the input SAS data set to an xpt file if the data set is within the limits of the requirements. The last macro, %splitds, which is also the most complicated one, performs the tedious and difficult work of splitting the data set. When it has been determined that the size or record requirements have been violated, this macro can split the data set according to the specific name conventions and rules. It then converts the resulting SAS data sets to xpt files and exports them. Each of these macros is described in detail below. (1). Calculating the size of a data set The important point about data set size estimation is that it must be performed within a program before the final data set or transport file is created. Programmers should not try to make the size estimation after viewing the final data set or transport file from the Windows Explorer. The following SAS codes demonstrate how the size of a data set is estimated with the library information extracted from the PROC CONTENTS procedure. The input SAS data set is represented by the parameter ds. The macro %DsSize outputs a global macro variable, called ds_mb, that holds the size of the input SAS data set in megabytes (MB). %macro DsSize(ds= /* one or two level input SAS data set */); %global ds_mb; %if %index(&ds,%str(.)) ne 0 %then %let ds_ = %substr(&ds,%eval(%index(&ds,%str(.))+1)); %else %let ds_ = &ds; ods output "Library Members"=LibInfo; proc contents data=&ds memtype=data; ods output close; data _null_; set LibInfo; if Memname eq "%upcase(&ds_)" then do;

call symput('ds_mb', round(file_size/1048576,0.01)); %mend DsSize; (2). Estimating the number of Observations in a SAS data set The number of observations in a data set can be simply estimated with the SAS function ATTRN, as demonstrated by the macro %Ds_Obs. The macro %Ds_Obs outputs a global macro variable, n_obs, that holds the number of observations in the input SAS data set (ds). %macro Ds_Obs(ds= /* one or two level input SAS data set */); %global n_obs; %let dsid = %sysfunc(open(&ds)); %let n_obs= %sysfunc(attrn(&dsid,nobs)); %let dsid = %sysfunc(close(&dsid)); %m (3).Converting a SAS data set to a SAS transport file when it is within the required size and record limits There are normally two ways to convert a SAS data set into a SAS transport file: one is with the engine XPORT and the procedure PROC COPY, and the other is with the engine XPORT and the SAS data step. The macro %sas2xpt adopts the second approach. It uses the engine XPORT and the SAS data step to convert the SAS data set ds into the SAS transport file &xptname..xpt, which it stores under the directory xptdir. %macro sas2xpt(ds=, xptdir=); %local outfile; %if %index(&ds,.)>0 %then %let xptname=%substr(&ds,%eval(%index(&ds,.)+1)); %else %let xptname=&ds; %let xptname=%substr(&xptname,1,%sysfunc(min(8,%length(&xptname)))); %let outfile=%sysfunc(compress(&xptdir\&xptname..xpt)); libname yyy xport "&outfile"; data yyy.&xptname; set &ds; libname yyy clear; %mend sas2xpt; (4). Splitting a SAS data set into multiple SAS transport files if it exceeds the required limits If a data set surpasses the limits by being more than 25 MB in size or containing more than 62999 records, it must be split into smaller groups. The strategy employed here is that, when a data set is split, all of the data on a particular patient should be contained in one file, and, if possible, data on all patients from the same study site should be contained within the same file. Since this strategy allows for better organization of the split data, it facilitates the work of FDA reviewers. In addition, we have applied specific naming conventions as a method for organizing the split files. The files split from the same data set should use the same root name ending with a number that increases sequentially for each file. If the total number of split files is less than ten, then the root name for each file contains up to

seven letters from the original data set file name and ends with a number from 1 though 9, respectively. Assume, for example, that the data set DEMODATA exceeds the size limit and is split into three data sets. The split data sets will be named DEMODAT1, DEMODAT2 and DEMODAT3. However, if the total number of split files is ten or greater, then the root name for each file contains up to six letters from the original data set file name and ends with a number from 01 to 99, respectively. (We assume that the number of split files will not exceed 99.) For example, assume that the data set LABADATA exceeds the size limit and is split into twelve data sets. The resulting split data sets will be named LABADA01, LABADA02 LABADA10, LABADA11 and LABADA12. Finally, after a data set is split into multiple sub-data sets with appropriate root names, these sub-data sets must be converted into xpt files. The entire process of splitting a SAS data set into multiple SAS transport files is obviously a complicated one. In order to complete this process, we have developed the following SAS code, which is divided into five steps. Step 1. Estimate the number of observations per output data set. In order to estimate the number of observations per output data set, the PROC CONTENTS procedure is used to obtain the Engine/Host information about the input data set (ds). The following SAS code shows how the number of observations per output data set is estimated using Data Set Page Size and Max Obs per Page. The Obs in First Data Page and Max Obs per Page are used to make sure that the size of the output data set does not exceed the limit of 25 MB. The macro variable ObsPerDs holds the number of observations per output data set and will be used later. ods output "Engine/Host Information"=hostinfo; proc contents data=&ds; ods output close; data _null_; set hostinfo end=eof; retain pagesize maxobspp obspage1; if Label1 eq "Data Set Page Size" if Label1 eq "Max Obs per Page" if Label1 eq "Obs in First Data Page" then pagesize = cvalue1; then maxobspp = cvalue1; then obspage1 = cvalue1; if eof then do; ObsPerDs= obspage1+maxobspp*(floor(25*1024*1024/pagesize)-1); call symput('obsperds',compress(put(obsperds,8.0))); Step 2. Calculate the total number of observations in each study site. The second step is to obtain the observation counts for each study site in order to judge if the study site can be placed into one data set. proc sort data=&ds out=dscopy; by Study_Site Patient_ID; data SiteCount(keep=Study_Site Site_Count); set DsCopy; retain Site_Count; by Study_Site; if first.study_site then Site_Count=0; Site_Count + 1;

if last.study_site then output; data DsCopy; merge DsCopy SiteCount; by Study_Site; Step 3. Assign Set_ID to each observation in the data set. The variable Set_ID is used to label each observation in the input data set in order to determine in which output data set the observation will be stored. The macro variable tot_ds holds the total number of output data sets to be created after the input data set is split. For example, if an input data set is split into three output data sets, then the variable Set_ID will be given the values of 1, 2 and 3, respectively, and the tot_ds will be 3. data SetID(keep=Study_Site Patient_ID Set_ID); set DsCopy end=eof; by Study_Site Patient_ID; retain _IDCnt _SetCnt 0 Set_ID 1; if first.study_site then do; if Site_Count <= &ObsPerDs and _SetCnt + Site_Count > &ObsPerDs then do; Set_ID + 1; _SetCnt=0; if first.patient_id then _IDCnt=0; _IDCnt+1; if last.patient_id then do; _SetCnt + _IDCnt; if _SetCnt > &ObsPerDs then do; Set_ID + 1; _SetCnt=_IDCnt; output; if eof then do; call symput("tot_ds",trim(left(put(set_id,3.)))); Step 4. Determine the root names of the multiple output data sets. In accordance with the naming conventions, this step determines the root names of the multiple output data sets that are created after the input data set is split. As previously explained, if the total number of split data sets is less than ten, then the root name for each split data set consists of up to seven letters of the file name of the input data set, ending with a number from 1 through 9, respectively. If the total number of split data sets is between the numbers 10 and 99, then the root name for each split data set consists of up to six letters of the file name of the input data set, ending with a number from 01 through 99, respectively. %let rt=%scan(%substr(&ds,%eval(%index(&ds,.)+1)),1,.); %let root=%substr(&rt,1,%sysfunc(min(%length(&rt),%eval(8-%length(&tot_ds))))); Step 5. Split the input data set (ds) into multiple data sets and convert them to xpt files.

The following codes perform the process of splitting the large data set into multiple sub-data sets assigned by the macro variable Set_ID, which was calculated above in Step 3. A DO loop will iterate until the total number of split data sets, tot_ds, is reached. Once the multiple files are established individually, the macro %sas2xpt is called for each split data set to output the multiple xpt files. %do j=1 %to &tot_ds; %if (&tot_ds >= 10) and (&j < 10) %then %let xptname=%sysfunc(compress(&root.0&j)); %else %let xptname=%sysfunc(compress(&root.&j )); data &xptname(drop=site_count Set_ID); merge DsCopy SetID; by Study_Site Patient_ID; if Set_ID=&j; %sas2xpt(ds=&xptname, xptdir=&xptdir); proc datasets nolist; delete &xptname; quit; % proc datasets nolist; delete DsCopy SiteCount SetID; quit; APPLICATION We used an actual SAS data set, "vital.sas7bdat," to test the macro %SetXpt.sas in the Windows NT environment. Since we determined the size of this SAS data set to be around 134 MB, which exceeds the limit of 25 MB, it was split into several sub-data sets. These sub-data sets were ultimately converted into xpt transport files. The output from running the macro %SetXpt.sas on the data set vital.sas7bdat is displayed as follows: libname datadir "C:\data_analysis"; %macro SetXpt(ds=datadir.vital, xptdir=vital); The following results are printed in the Windows Explorer screen: Clearly, the data set vital.sas7bdat has been split into six subsets, named vital1.xpt, vital2.xpt vital6.xpt, respectively. Each of these files is smaller than the maximum of 25 MB because we followed the strategy of keeping the same study site and patient in the same file. Macro %SetXpt has also been tested on many larger data sets, including lab data that has been split in up to 26 subsets. The results of these tests indicate that the macro works reliably for all different types of large data sets.

CONCLUSION How to accurately and efficiently estimate the size of a data set and convert it into a SAS transport file is a very important issue to resolve prior to an FDA submission. If a data set exceeds the FDA size limitation, it must be split into two or more sub-data sets. In this paper, we presented a utility macro that not only converts a SAS data set into a transport file, but also estimates the size and number of records of the data set and determines whether the limits have been exceeded. If necessary, this macro splits the data set and automatically assigns names to the multiple files it creates, according to specific strategies and naming conventions. This macro has been tested on SAS data sets in many projects for FDA submission and has proved to work accurately and efficiently, thus providing another valuable convenience for use in electronic submissions. REFERENCES SAS Macro Language Reference, First Edition Copyright 1997 by SAS Institute Inc., Gary, NC, USA ACKNOWLEDGEMENTS The authors would like to thank Donna Usavage, Allan Glaser and Jodi Benjamin for their valuable suggestions and comments. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: Author Name: Xingshu Zhu Company Merck &Co., Inc. Address : UNA-102, 785 Jolly road, Blue Bell, PA 19422 Work phone: 484 344 3572 Fax: 484 344 7105 Email: xingshu_zhu@merck.com Author Name: Shuping Zhang Company Merck &Co., Inc. Address : 10 Sentry Parkway, Blue Bell, PA 19422 Work phone: 484 344 3496 Fax: 484 344 7105 Email: Shuping_zhang@merck.com