Reducing Statisticians Programming Load: Automated Statistical Analysis with SAS and XML Michael C. Palmer, Zurich Biostatistics, Inc., Morristown, NJ Cecilia A. Hale, Zurich Biostatistics, Inc., Morristown, NJ ABSTRACT Statisticians often spend more time programming and supervising the programming for tables than they spend on the statistical analyses reported in the tables. Features new in versions 7 and 8 of SAS coupled with XML make it possible to reduce this programming load by automating analysis steps. In the automated system, a statistician prepares table shells and database maps with publishing software that saves both as XML files. SAS programs interpret the XML, retrieve data, and combine the interpreted XML and retrieved data to build XML files. Publishing software renders the XML files to PDF or other formats. The result is that clinical data pass from an analysis data set into a report-quality document without a statistician or programmer writing or setting up table programs or macros. As table content is revised, the system can revise the reportquality tables. Table shells are reusable within a report and across reports. INTRODUCTION The first step in implementing a statistical analysis plan for a clinical trial is purely statistical. Questions of clinical interest are translated into statistical hypotheses, the data are analyzed, and, from the statistical results, inferences are made in the clinical realm about drug safety and efficacy. The second step involves presenting the results in a way that supports the inferences. Statisticians often find that the programming, or supervision of programmers, for results presentation such as statistical tables takes a burdensome amount of time. The availability of new software technology coupled with new features in versions 7 and 8 of the SAS system make it possible to automate some steps in statistical table production and, as a consequence, to reduce the programming load. The new technology is XML (extensible Markup Language). XML is a standard for electronic documents and data exchange. It was developed by the World Wide Web Consortium (W3C ) and released in February 1998. XML is non-proprietary and platform independent. SAS versions 7 and 8 support an experimental XML driver as part of the Output Delivery System (ODS) but the work reported here does not use that driver. Only stable, fully supported features of SAS software were used. The new features in versions 7 and 8 that enabled the work reported here include regular expression functions and 32k length for variables and format decode values. DOCUMENTS REPLACE PROGRAMS Statistical table programming can be largely replaced by two documents. One document is a table shell that contains style and composition information, but no content. The other document is a database map that connects the shell to specific records in a database of table content. These two documents are built with the familiar graphical user interface in commercial publishing software. The result is a classic automation. The labor-intensive programming tasks traditionally required to build statistical tables are replaced by the quicker and cheaper document interface and a higher-quality fully word-processed table is produced. Figure 1. Table shell Figure 1 shows a typical table shell in the automated system. This shell has the style and composition of the final table but instead of any table content, either text or numeric, there are row and column indices. Style features in the shell include increased font size and bolding for the 1,1 entry, center alignment for 1,1 and 2,1, shading for 2,1, and italics and reduced font size for 9,1. Composition features in the shell include borders, variable cell dimensions, and the varied number of columns for each row,. Figure 2. Database map Figure 2 shows a fragment of a database map. Each cell in the table shell has one or more corresponding entries in the map that point to data for the final table. The "DATASET" column in the map identifies a data set for each shell entry. Under the "CONDITION" column, "KEYS=" gives sort keys for the retrieved data in the final table, and "RULES=" gives subsetting criteria to apply to the data set to retrieve data. These two documents contain all of the information needed to produce a table. Style and composition information is in the table shell and content look-up information is in the database map. With the legacy method of creating statistical tables, that is, programming, this information would exist in program code, difficult to locate and difficult to revise or reuse once located. The segregation of style and composition from table content facilitates the revision and reuse of each. The table shell can be used as-is with a different database map, for the same study or a different study. Table content, including text such as titles, row headers, and column labels, can be revised by editing a database instead of modifying a program.
Figure 3. Screen capture of table shell in publishing software The table shell and database map documents were built in Arbortext s Adept Publisher. Adept is a full-featured commercial, off-the-shelf publishing package with a graphical user interface. As Figure 3 shows, Adept resembles Microsoft Word, except for the document map on the left side of the figure. The document map shows the markup language behind the document. Unlike Word, Adept can represent documents in XML and save the documents as XML. XML documents can be passed among applications for processing, between Adept and SAS in this case, and that capability guided the choice of publishing software. The populated, publishable table is built by Zurich Biostatistics' proprietary SAS programs. The software reads, interprets, and implements the XML versions of the table shell and database map documents. The shell and map can be built before the table content database is populated. SIMPLE PROGRAMMING REPLACES COMPLICATED PROGRAMMING Figure 4. Table content database Data set 1 Data set 2 Data set 3 Data set 4 The table content database has a simple structure that is uniform for all data sets in the database (Figure 4). Each record in the table content database has just one element of table content. Along with the table content, there is a set of unique keys and administrative variables such as a date and time stamp. Keys identify features of the statistical analysis, the variables analyzed, and as many other factors as are required to uniquely identify an element of content. The keys are used in the database map to select data for retrieval and to sort the retrieved data for the final table. Every statistical result that will appear in a table is in the database and every text item in a table (titles, row labels, column headings, for instance) has an associated numeric index in the database. The indices are decoded to text when the table is created. The table content database is populated as the final step in statistical analysis programs. To do this, the usual statistical analysis programs are written and results that will be placed in a table are saved with an output data set from the relevant SAS PROC or from ODS. In versions 7 and 8, every PROC in SAS/STAT supports ODS. The output datasets are processed to assign keys and administrative variables to each result and the results are output to the table content database. In the table content database, text exists in numeric variables. Each numeric value is decoded to text when the table is built. The existence of 32k format decodes in versions 7 and 8 makes this a practical way to handle text. This simple programming to dice up the output datasets is all of the programming needed to produce the final tables. SOFTWARE-TO-SOFTWARE TALK REPLACES MEETINGS BETWEEN TABLE PROGRAMMERS AND STATISTICIANS Publishing software, such as Adept, is specialized for the creation and maintenance of report-quality documents. SAS is specialized for statistical analysis, but is a clumsy tool for document creation. Statisticians programming load can be reduced by choosing the best tool for each task in implementing a statistical analysis plan. The methods discussed in this paper make it possible to choose SAS for statistical analysis and publishing software for report creation and maintenance. For this to work, the two tools have to talk to each other, and XML is the language of this conversation. Zurich Biostatistics developed a set of programs written in SAS 2
that read the XML documents prepared in the publishing software, interpret them, implement them, and build styled, composed, populated tables in XML. The finished tables in XML are returned to the publishing software for rendering to paper or electronic formats such as Adobe s PDF. The FDA s recent guidance on electronic submissions recommends PDF for electronic documents submitted to the agency. These programs along with the architecture of table shell, database map, table content database, and finished table are termed Tekoa Technology SM. AUTOMATED WORD-PROCESSED FINAL PRODUCT REPLACES LINE-PRINTER OUTPUT Figure 5. Published table as PDF in Adobe Acrobat Reader REUSE OF DOCUMENTS REPLACES REWORK OF PROGRAMS The separation of table style and composition in reusable table shell documents, on the one hand, and table content, including text, in datasets, on the other hand, eliminates the need to revise programs as tables are revised. Information for a specific table, either how it looks or its content, simply does not exist in program code. Revisions to table content, text or numeric, are edits of the table content database. The pre-existing table shell document is reused. Table shells can be reused within a project or across projects without modification or programming. Tekoa Technology is a classic automation. It s cheaper and quicker than the legacy process of manual programming and it also produces a higher quality product. The word-processed statistical tables automatically produced with Tekoa Technology include the style and composition features available in the commercial publishing software. These are features such as proportional fonts, font sizes, shading, italics, superscripts, subscripts, vertical text alignment, horizontal text alignment, page x of y numbering, and page-specific footnotes. Style features are built into the table shell when it is constructed in the graphical user interface and automatically propagated in the finished table that is built in Tekoa s SAS programs or added from a style sheet. There is no manual editing of the table. CONCLUSION SAS can work with publishing software and XML in an automated system to implement the statistical analysis plan for a clinical trial. The automated system uses documents instead of programming to apply the full range of publishing software style and composition features to statistical results. Compared to the legacy system of manual programming, the automated system produces a higher quality product with less programming. 3
APPENDIX Published table as paper output 4
REFERENCES XML spec: www.w3.org/tr/rec-xml Scientific American article on XML: www.scientificamerican.com/1999/0599issue/0599bosak.html ACKNOWLEDGMENTS SAS is a registered trademark of SAS Institute Inc. XML is a trademark of Massachusetts Institute of Technology. W3C is a registered trademark of the World Wide Web Consortium. Adept is a registered trademark of Arbortext, Inc. Word is a trademark of Microsoft Corporation. Adobe is a registered trademark of Adobe Systems, Inc. Tekoa Technology is a service mark of Zurich Biostatistics, Inc. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: Michael Palmer or Cecilia Hale Zurich Biostatistics, Inc. 45 Park Place South PMB 178 Morristown, NJ 07960 Work Phone: 973-727-0025 Email: mcpalmer@zbi.net or cahale@zbi.net Web: www.zbi.net 5