Paper AD13 Create Metadata Documentation using ExcelXP Christine Teng, Merck Research Labs, Merck & Co., Inc., Rahway, NJ ABSTRACT The purpose of the metadata documentation is two-fold. First, it facilitates quick understanding of the project design. Second, it rapidly validates that data sets and variables adhere to the electronic submission requirements for clinical trials. SAS 9 provides several approaches to create Excel output. There is an experimental tagset called ExcelXP that is available for download from the ODS Markup Resources site at http://support.sas.com/rnd/base/topics/ odsmarkup/. The SAS 9 ExcelXP tagset generates XML output that conforms to the Microsoft XML Spreadsheet Specification ("XML Spreadsheet Reference", Microsoft Corp.). One can create XML output on UNIX or Windows platform and the XML output can be read by EXCEL 2000 and later releases. In this paper, I use the ExcelXP tagset in conjunction with the SAS Dictionary to create metadata documentation for a group of data sets from a mocked clinical trial project. A SAS macro is created based on the requirements as follows: 1. Create a project workbook that contains multiple worksheets. 2. Create a metadata table inside a worksheet for each data set. 3. If any given data set has a test code, create a second table that lists the test codes under the metadata table within the same worksheet. 4. Create a worksheet that is comprised of all variables within a project. In addition, identify all data sets that contain the individual variable. 5. In a separate worksheet, create a global dictionary for all test codes defined in the project along with the associated test description defined in the PROC FORMAT. SAS 9, Windows, Intermediate Level Key Words: ExcelXP, Tagset, SAS Dictionary, PROC SQL INTRODUCTION The SAS 9 ExcelXP tagset generates XML output that conforms to the Microsoft XML Spreadsheet Specification ("XML Spreadsheet Reference", Microsoft Corp.). It provides the functionality to create multiple worksheets in a workbook as well as multiple tables within a single worksheet. These features are very useful for creating metadata documentation where each data set has its own worksheet with label. It enables quicker accessibility to locate the information for a group of data sets. With SAS DICTIONARY and PROC SQL, the metadata documentation can be created without hard coding. The details of using PROC SQL and SAS DICTIONARY will not be covered here. For more information regarding the SAS DICTIONARY and PROC SQL, please refer to the SAS manuals or the paper I coauthored for PharmaSUG 2006 - Simple Ways to Use PROC SQL and SAS DICTIONARY TABLES to Verify Data Structure of the Electronic Submission Data Sets. This paper is not a tutorial about the ExcelXP tagset. Rather, it demonstrates another application using the ExcelXP tagset. The detailed tutorials and references for the ExcelXP tagset can be found at the references section of this paper. In order to control the appearance of the output within Excel, PROC TEMPLATE can be used to create a style template. A template defines how to format output produced by a procedure or data step. For information about PROC TEMPLATE, please consult this site: http://support.sas.com/rnd/base/ topics/ odsmarkup/ tagsets.html. SAS provides many standard templates that allow for customization. To see a list of templates provided by SAS, (1) go to the Results windows, (2) right click on Results and select Template, (3) expand sashelp.tmplmst (See Table-1 in Appendix). In the macro that builds the metadata documentation, I created a customized style template that uses certain fonts, colors and spacing inside my Excel workbook. This step is not required to use ExcelXP. However, style template makes the output more presentable. DESIGN REQUIREMENTS The following are the requirements for the metadata documentation:
A. Create a macro program with two parameters: 1. DATADIR is used to assign the input library name. 2. DSETNAME is used to assign a list of data sets separated by +. The prefer design is that DATADIR is a required variable. If the value of DSETNAME is not provided, all data sets under DATADIR directory should be used. Otherwise, use the specified data sets in the DSETNAME macro variable. For this exercise, we use the data sets provided in the DSENAME macro variable. %ls_datastruc(datadir = datadir, dsetname = demog_mk+weighte_mk+vital_mk+labchem_mk+ms_mk) B. Create a metadata table inside a worksheet for each data set defined in the macro parameters. The label of each data set should be listed first, followed by the attributes of the variables. (See Table-2 in Appendix) C. If a data set contains an EXAM_CD field, create a second table after the metadata table in the same worksheet. (See Table-3 in Appendix) D. After all worksheets of data sets are created; create a worksheet that is comprised of all variables from the individual worksheet to build a global dictionary for the data sets that were specified. In addition, identify all data sets that contain the variable. (See Table-4 in Appendix) This worksheet is used to cross-reference all tables and allows one to quickly spot any inconsistencies. For example, in Table-4, the EXAMPARM variable appears twice with different attributes, it means that the variable was defined differently among programs. We need to go back to correct the definition of the variable if they should have the same attributes, or give a new name if the difference is intentional. E. Create a global dictionary of the test code (variable name is EXAM_CD) to list all EXAM_CD defined in the data sets provided in the DSETNAME macro parameter. In addition, include the EXAM_CD description provided in PROC FORMAT. (See Table-5 in Appendix) Normally, we use PROC FORMAT data on table output such as title or test name. This worksheet checks if a description is associated with the correct test code. IMPLEMENTATION Since the ExcelXP tagset is still evolving, there are some limitations and hence its functionality may be changed in the future. It is recommended that user always download the latest update to verify the changes and enhancements. To use the ExcelXP tagset, first download the latest ExcelXP tagset from the ODS MARKUP page. This page also provides links to documentation for using and customizing tagsets. For this exercise, I use ExcelXP Tagset version dated June 2006. Before using the ExcelXP tagset, check the codes or execute the following to see a list of options available in the ExcelXP tagset: ODS tagsets.excelxp file = "test.xml" options(doc="help"); Under the pre-configuration part of the requirement A below, only specifications are described since coding for this part is not the focus of this paper. The sections where the worksheets are built have more detailed coding information. REQUIREMENT A Create a macro program with two parameters. %MACRO ls_datastruc(datadir=, dsetname=); *Pre-configuration before building the worksheets; NULLTBL A table used to build header in the global worksheets for the requirement D and E. GLOBTBL A table that contains all variables from the data sets of DSETNAME list and each variable has a list of tables that contain this variable. It is built from the dictionary_columns table and is used in the requirement D..
TESTTBL A table that contains all the EXAM_CD and the associated exam_cd short description. The exam_cd values are collected from the individual data set within the DSETNAME list. This is used in the requirement E. FMTDESCP A table that was created by loading the format using PROC FORMAT CNTLOUT= option. This is used in the requirement E. This table contains the full descriptions of the exam codes defined in PROC FORMAT. EXAMLST A macro variable that contains all data set that has the variable exam_cd. This is used to build the sub-table for the requirement C. *Set up the style template; proc template; define style styles.xlstatistical; parent = styles.statistical; : : *Set up the workbook; Include the ExcelXP tagset code ods listing close; ods tagsets.excelxp path = c:\temp\excelxp file = AD13.xml style = XLStatistical; %MEND; *Build the worksheets (see requirements below); REQUIREMENT B Create a metadata table inside a worksheet for each data set defined in the macro parameters. %let num=1; %let list = %upcase(%scan(&dsetname, &num, '+')); %*Use Do-While loop to create individual worksheet; %do %while (&list. ne ); *Create worksheet with defined options; ods &_ODSDEST options(absolute_column_width = 6, 16, 6, 35, 35 sheet_interval = none sheet_name = &list ); *Print data set name and label at the beginning of the sheet; select ' ', substr(memname,1) as Data_Set, ' ', substr(memlabel,1) as Data_Set_Label, ' ' as Created_by from dictionary.tables where libname = "DATADIR" and memtype = "DATA" and memname = "&list"; *Print data set columns and attributes information; select int(varnum) as Pos, upcase(name) as VarName, propcase(catx('',type,put(length, best4.))) as TypeLen, substr(label,1) as Label, ' ' as Deriviation_Comments from dictionary.columns where libname = "DATADIR" and memtype = "DATA" and memname = "&list" order by varnum;
REQUIREMENT C If a data set contains an EXAM_CD field, create a second table after the metadata table in the same worksheet. %if %index(&examlst., &list.) %then %do; select distinct exam_cd label='exam Code', examunit Label = 'Unit', ' ', examparm as description, ' ' as Week from datadir.&list.; %end; %*Ready to build the next worksheet; %let num = %eval(&num + 1); %let list = %upcase(%scan(&dsetname, &num, '+')); %end; REQUIREMENT D Create a worksheet that is comprised of all variables from the individual worksheet to build a global dictionary for the data sets that were specified. In addition, identify all data sets that contain the variable. ods &_ODSDEST options(absolute_column_width = 10, 6, 30, 85 sheet_interval = none sheet_name = VarDictionary ); *Create a header at the beginning of the worksheet; select ' ' label='purpose: ', ' ', ' ' label = 'Reference for Variable Dictionary', ' ' from NULLTBL; *Create variable dictionary and the tables that contain it; select VarName, TypeLen, Label, memnames label = 'In Data Sets' from GLOBTBL order by varname; REQUIREMENT E Create a global dictionary of the test code (variable name is EXAM_CD) to list all EXAM_CD defined in the data sets provided in the DSETNAME macro parameter. In addition, include the EXAM_CD description provided in PROC FORMAT. ods &_ODSDEST options(absolute_column_width = 10, 25, 65 sheet_interval = none sheet_name = StudyTests ); *Create a header at the beginning of the worksheet; select ' ' label='purpose: ', ' ' label='list of Tests Done' from NULLTBL; *Create exam_cd dictionary with description from PROC FORMAT; select distinct a.exam_cd label='exam Code', a.examparm Label = 'Parameter Name', b.description label = 'Format Description' from TESTTBL a left join FMTDESCP b on a.exam_cd = b.exam_cd order by a.exam_cd; As shown above, I only use a few options provided by ExcelXP. With the use of PROC SQL, SAS DICTIONARY tables and ExcelXP, I am able to quickly build up the workbook with multiple worksheets that contain the metadata information for a list of data sets. This information is very useful to help learn or verify a project database design.
SUMMARY ExcelXP is one of the many tools in SAS to create Excel output. It allows simple configurations to generate Excel output. With SAS Dictionary tables, I found it very useful and simple to create documentation for quality assurance purpose. Please visit SAS support website at http://support.sas.com/rnd/base/topics/odsmarkup/ for additional ExcelXP tagset information and examples. REFFERENCES DelGobbo, V. 2006. "Creating AND Importing Multi-Sheet Excel Workbooks the Easy Way with SAS ". Proceedings of the Thirty-First Annual SAS Users Group International Conference, 31. CD-ROM. Paper 115. Gebhart, E. 2005. " ODS Markup: The SAS Reports You've Always Dreamed Of ". Proceedings of the Thirtieth Annual SAS Users Group International Conference, 30. CD-ROM. Paper 85. Zender, C. 2005. "The Power of Table Templates and DATA _NULL_". Proceedings of the Thirtieth Annual SAS Users Group International Conference, 30. CD-ROM. Paper 88. PharmaSUG 2006 Paper: "Simple Ways to Use PROC SQL and DICTIONARY TABLES to Verify Data Structure of the Electronic Submission Data Sets" By Christine S. Teng and Wenjie Wang. SAS Macro Language: Reference SAS SQL Procedure User s Guide ACKNOWLEGEMENTS The author would like to thank the management team for their encouragement and review of this paper. TRADEMARKS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks of their respective companies. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Christine Teng Merck & Co., Inc. Rahway, NJ 07065 christine_teng@merck.com APPENDIX (Continue to next page)
Table 1 (Available Tagsets in SAS 9) Table 2 (Requirement B)
Table 3 (Requirement C) Table 4 (Requirement D)
Table 5 (Requirement E)