Lazy Programmers Write Self-Modifying Code OR Dealing with XML File Ordinals

Similar documents
Accessibility Features in the SAS Intelligence Platform Products

ABSTRACT MORE THAN SYNTAX ORGANIZE YOUR WORK THE SAS ENTERPRISE GUIDE PROJECT. Paper 50-30

XML in the DATA Step Michael Palmer, Zurich Biostatistics, Inc., Morristown, New Jersey

TLF Management Tools: SAS programs to help in managing large number of TLFs. Eduard Joseph Siquioco, PPD, Manila, Philippines

IMU8420 V1.03 Release Notes

This chapter is intended to take you through the basic steps of using the Visual Basic

Power Up SAS with XML Richard Foley, Anthony Friebel, SAS, Cary NC

Geocoding Crashes in Limbo Carol Martell and Daniel Levitt Highway Safety Research Center, Chapel Hill, NC

TIPS AND TRICKS: IMPROVE EFFICIENCY TO YOUR SAS PROGRAMMING

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ

A Practical Introduction to SAS Data Integration Studio

Intro. Scheme Basics. scm> 5 5. scm>

Understanding JDF The Graphic Enterprise It s all about Connectivity

LST in Comparison Sanket Kale, Parexel International Inc., Durham, NC Sajin Johnny, Parexel International Inc., Durham, NC

Without further ado, let s go over and have a look at what I ve come up with.

SAS Data Integration Studio Take Control with Conditional & Looping Transformations

Use That SAP to Write Your Code Sandra Minjoe, Genentech, Inc., South San Francisco, CA

Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

Workshop. Import Workshop

Taming a Spreadsheet Importation Monster

shortcut Tap into learning NOW! Visit for a complete list of Short Cuts. Your Short Cut to Knowledge

One SAS To Rule Them All

Teiid Designer User Guide 7.5.0

Excel Basics Rice Digital Media Commons Guide Written for Microsoft Excel 2010 Windows Edition by Eric Miller

SAS Data Libraries. Definition CHAPTER 26

Basics of Stata, Statistics 220 Last modified December 10, 1999.

A Guided Tour Through the SAS Windowing Environment Casey Cantrell, Clarion Consulting, Los Angeles, CA

Step through Your DATA Step: Introducing the DATA Step Debugger in SAS Enterprise Guide

Paper HOW-06. Tricia Aanderud, And Data Inc, Raleigh, NC

Write SAS Code to Generate Another SAS Program A Dynamic Way to Get Your Data into SAS

PharmaSUG Paper TT11

CSCI S-Q Lecture #12 7/29/98 Data Structures and I/O

CIS 45, The Introduction. What is a database? What is data? What is information?

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

Developing SAS Studio Repositories

DSCI 325: Handout 2 Getting Data into SAS Spring 2017

PART COPYRIGHTED MATERIAL. Getting Started LEARN TO: Understand HTML, its uses, and related tools. Create HTML documents. Link HTML documents

New Program Update Sept 12, 2012 Build 187

Week - 01 Lecture - 04 Downloading and installing Python

Hypertext Markup Language, or HTML, is a markup

HTML for the SAS Programmer

Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring Topic Notes: C and Unix Overview

ODS TAGSETS - a Powerful Reporting Method

ODS/RTF Pagination Revisit

The Ins and Outs of %IF

Web API Lab. The next two deliverables you shall write yourself.

Best Practice for Creation and Maintenance of a SAS Infrastructure

Can you decipher the code? If you can, maybe you can break it. Jay Iyengar, Data Systems Consultants LLC, Oak Brook, IL

New Vs. Old Under the Hood with Procs CONTENTS and COMPARE Patricia Hettinger, SAS Professional, Oakbrook Terrace, IL

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

The name of our class will be Yo. Type that in where it says Class Name. Don t hit the OK button yet.

Using a Control Dataset to Manage Production Compiled Macro Library Curtis E. Reid, Bureau of Labor Statistics, Washington, DC

Monitoring Tool Made to Measure for SharePoint Admins. By Stacy Simpkins

WHAT IS THE CONFIGURATION TROUBLESHOOTER?

ABSTRACT: INTRODUCTION: WEB CRAWLER OVERVIEW: METHOD 1: WEB CRAWLER IN SAS DATA STEP CODE. Paper CC-17

SAS System Powers Web Measurement Solution at U S WEST

Math Dr. Miller - Constructing in Sketchpad (tm) - Due via by Friday, Mar. 18, 2016

Creating and Executing Stored Compiled DATA Step Programs

The Ugliest Data I ve Ever Met

CS2900 Introductory Programming with Python and C++ Kevin Squire LtCol Joel Young Fall 2007

What to Expect When You Need to Make a Data Delivery... Helpful Tips and Techniques

Mr G s Java Jive. #11: Formatting Numbers

Using a Fillable PDF together with SAS for Questionnaire Data Donald Evans, US Department of the Treasury

If Statements, For Loops, Functions

From Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX

Using UNIX Shell Scripting to Enhance Your SAS Programming Experience

Program Validation: Logging the Log

Teiid Designer User Guide 7.7.0

Unlock SAS Code Automation with the Power of Macros

XP: Backup Your Important Files for Safety

ITNP43 HTML Practical 2 Links, Lists and Tables in HTML

Anyone Can Learn PROC TABULATE, v2.0

GEO 425: SPRING 2012 LAB 9: Introduction to Postgresql and SQL

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

Really quick guide to DocBook

Tutorial for downloading and analyzing data from the Atlantic Canada Opportunities Agency

Using SAS 9.4M5 and the Varchar Data Type to Manage Text Strings Exceeding 32kb

My SAS Grid Scheduler

Transitioning from Batch and Interactive SAS to SAS Enterprise Guide Brian Varney, Experis Business Analytics, Portage, MI

Improving Your Relationship with SAS Enterprise Guide Jennifer Bjurstrom, SAS Institute Inc.

Lecture 5. Essential skills for bioinformatics: Unix/Linux

Liquibase Version Control For Your Schema. Nathan Voxland April 3,

Reading How the Web Works

SAS Macro Dynamics: from Simple Basics to Powerful Invocations Rick Andrews, Office of Research, Development, and Information, Baltimore, MD

Chapter 7 File Access. Chapter Table of Contents

VARIABLES. Aim Understanding how computer programs store values, and how they are accessed and used in computer programs.

Optimizing System Performance

Don Hurst, Zymogenetics Sarmad Pirzada, Hybrid Data Systems

Use SAS/AF, SCL and MACRO to Build User-friendly Applications on UNIX

Utilizing SAS to Automate the Concatenation of Multiple Proc Report RTF Tables by Andrew Newcomer. Abstract

SAS Viya 3.1 FAQ for Processing UTF-8 Data

TMG Clerk. User Guide

FROM 4D WRITE TO 4D WRITE PRO INTRODUCTION. Presented by: Achim W. Peschke

EasyCatalog For Adobe InDesign

Intro to Programming. Unit 7. What is Programming? What is Programming? Intro to Programming

There are two (2) components that make up the Fuel Import Manager:

It s not the Yellow Brick Road but the SAS PC FILES SERVER will take you Down the LIBNAME PATH= to Using the 64-Bit Excel Workbooks.

How To Upload Your Newsletter

Paper AD SAS and Code Lists from Genericode Larry Hoyle, IPSR, University of Kansas, Lawrence, KS

Building a Data Warehouse with SAS Software in the Unix Environment

Transcription:

Paper 2454-2018 Lazy Programmers Write Self-Modifying Code OR Dealing with XML File Ordinals David B. Horvath, CCP ABSTRACT The XML engine within SAS is very powerful but it does convert every object into a SAS data set with generated keys to implement the parent/child relationships between these objects. Those keys (Ordinals in SAS speak) are guaranteed to be unique within a specific XML file. However, they restart at 1 with each file. When concatenating the individual tables together, those keys are no longer unique. We received an XML file with over 110 objects resulting in over 110 SAS data sets, which our internal customer wanted concatenated for multiple days. Rather than copying and pasting the code to handle this process 110+ times, and knowing that I would make mistakes along the way and knowing that the objects would also change along the way I created SAS code to create the SAS code to handle the XML. I consider myself a Lazy Programmer. As the classic "Real Programmers " sheet tells us, Real Programmers are Lazy. This session reviews XML (briefly), SAS XML Mapper, XML Engine, techniques for handing the Ordinals over multiple days, and finally discusses a technique for using SAS code to generate SAS code. INTRODUCTION XML, or extensible Markup Language is a method of creating files transferred between organizations. The primary advantage of using this form is that it is human readable and, more importantly, understandable. While other forms, like CSV are human readable, and to a certain extent, understandable when column headers are provided, they are a flat form. XML allows the creation of multiple levels of data with varying levels of cardinality. And it is certainly easier to deal with than the old method of creating a non-delimited fixed field size file. With a CSV, if you want to transfer account information that could have multiple owners who each have multiple phone numbers, you have to define all those fields from the beginning. With XML, each account will contain one to many owners who will have one to many phone numbers. If there is only one account owner with one phone number, only that data will be sent, not empty fields. While this is a nice feature it does make reading the data more difficult. An advantage and disadvantage is that you do not need to receive a record layout to begin reading and processing the data. The advantage is that you do not have to wait, the disadvantage is that this can lead to sloppiness and sudden, unexpected, changes (much like CSV). SOME BACKGROUND ON XML As I ve already mentioned, XML stands for extensible Markup Language it is a standard that was originally created in 1996 and consists of two main parts: Markup (the information about the data) and Content (the actual data). Markup is represented with tags; if you have ever looked at the source code for a web page, these will be familiar. If you re lucky, the vendor will provide you with a definition or schema, known as the XSD: XML Schema Definition. If you are not, you ll just have the XML file itself. But you can work from that, especially with the tools that SAS has provided. This also makes it easy for your data provider (internal or external to your organization) to change the layout: all they have to do is update the XSD and add the new data elements to the XML file. Or they can skip the XSD and just surprise you. 1

That flexibility can allow quick changes but if not communicated properly also to surprises. The best way to learn about XML is to look at it: <?xml version="1.0" encoding="utf-8" standalone="no"?><gpx xmlns="http://www.topografix.com/gpx/1/1" xmlns:gpxx="http://www.garmin.com/xmlschemas/gpxextensions/v3" xmlns:gpxtpx="http://www.garmin.com/xmlschemas/trackpointextension/v2" creator="nã¼vi 2370" version="1.1" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://www.topografix.com/gpx/1/1 http://www.topografix.com/gpx/1/1/gpx.xsd http://www.garmin.com/xmlschemas/trackpointextensionv2.xsd"><metadata><link href="http://www.garmin.com"><text>garmin International</text></link><time>2012-04-12T04:31:39Z</time></metadata><wpt lat="40.247249" lon="- 75.513001"><ele>28.72</ele><name>002</name><sym>Waypoint</sym></wpt><wpt lat="39.764033" lon="-75.551346"><ele>61.17</ele> But frankly, looking at the example from my portable navigation unit ( GPS ) isn t very helpful in this format. If you look at a raw XML file in a text editor, this is what you will see. Fortunately, web browsers will display XML for you with formatting and indenting that is not included in the original file. This helps to identify parent/child relationships. In this case, we re looking at GPX or GPS Exchange Format which is designed as a common data format for the exchange of waypoints (locations), routes (how to get somewhere), and tracks (how you actually got there). This makes it easy to process this kind of data in general. For this paper, it means I have easily available examples that do not violate the intellectual property and proprietary information of my employer. Error! Reference source not found. shows XML formatted in a much more readable, pretty format. 2

Figure 1. XML Viewed in a Web Browser For the remainder of the paper, we will reference this example: <?xml version="1.0" encoding="utf-8" standalone="no"?> <gpx xmlns="http://www.topografix.com/gpx/1/1" xmlns:gpxx="http://www.garmin.com/xmlschemas/gpxextensions/v3" xmlns:gpxtpx="http://www.garmin.com/xmlschemas/trackpointextension/v2" creator="nuvi 2370" version="1.1" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://www.topografix.com/gpx/1/1 http://www.topografix.com/gpx/1/1/gpx.xsd http://www.garmin.com/xmlschemas/gpxextensions/v3 http://www.garmin.com/xmlschemas/gpxextensionsv3.xsd http://www.garmin.com/xmlschemas/trackpointextension/v2 http://www.garmin.com/xmlschemas/trackpointextensionv2.xsd"> <metadata> <link href="http://www.garmin.com"> <text>garmin International</text> </link> <time>2012-04-12t04:31:39z</time> </metadata> 3

<wpt lat="40.224917" lon="-75.774154"> <name>031201 CAR</name> <sym>pin, Blue</sym> <extensions> <gpxx:waypointextension> <gpxx:categories> <gpxx:category>hand held</gpxx:category> </gpxx:categories> </gpxx:waypointextension> </extensions> </wpt> <trk> <name>active Log: 11 APR 2012 10:17</name> <trkseg> <trkpt lat="57.722606" lon="11.846760"> <ele>-18.61</ele> <time>2012-04-11t08:17:40z</time> <extensions> <gpxtpx:trackpointextension> <gpxtpx:course>358.59</gpxtpx:course> </gpxtpx:trackpointextension> </extensions> </trkpt> </trkseg> </trk> </gpx> Elements like <trk> (track) can have repeating sub-elements like <trkseg> (track segment) and <trkpt> (track point). A track point is a bread crumb along the way of your trip. Before I look at the data itself, I don t know how many tracks we will receive, how many segments each will have, nor how many bread crumbs were dropped along the way. I may not even know the data model for this information. Obviously, working with GPX, there are plenty of sources of information, so in this context the example is silly. But proprietary data from a vendor or less than cooperative internal data provider, you may not. The GPX XSD (XML Schema Definition) is available for download describing exactly what to expect in the XML data file. The purpose of this paper is not to teach you XML coding; there are plenty of sources for that starting with http://en.wikipedia.org/wiki/xml but having this background will help you when dealing with actual data. SAS XML MAPPING TOOL The SAS XML engine uses its own format for describing the contents of an XML file known as a map. The map goes beyond the definitions you might find in an XSD because it defines how you want to present the data in SAS datasets including how to connect between the various structures. To make your life easier, SAS provides a tool to create this map available as a free download from http://support.sas.com/kb/33/584.html The SAS XML Mapper can process the XML data file itself or an XSD to create the required map. When working with XML itself, it parses the file much like proc import. And, like proc import, it will look at a subset of a large file. As a result, any element it does not encounter, or if there are larger text fields later in the file, it may not cover all the elements correctly. Much like proc import, it also has to guess at the data type of elements. 4

A better approach is to use XML Mapper to process an XSD file when creating the map. The XSD provides a full definition reducing the guessing the tool has to perform. The only downside is that an XSD is not always available. A nice feature of the tool is that it creates its own keys to connect the elements known as ordinals. A child element will contain the ordinal of the parent. These ordinals are unique surrogate keys assigned during processing by the XML engine. You also edit the map (which is in a form of XML) yourself if you d like. For instance, adding additional ordinals allowing grandchildren to connect directly to the top level. In the Account/Owner/Phone number example, Account will contain its own ordinals, Owner will have both its and the Account ordinals, Phone will have both its and the Owner ordinals. But if you want to connect Phone to the account directory, you can manually edit the file and add them. The map itself is in XML format SAS XML ENGINE: CODING TO ACCESS XML DATA The coding within your SAS program to access XML data, once the map is created, is really very simple. First you need to define the files necessary: filename test "/export/home/myid/gpx.xml"; filename SXLEMAP "/export/home/myid/gpx.map"; libname test xml xmlmap=sxlemap access=readonly; The filename for your XML file and the libname should match ( test ) in this example. You also need to define the name of the map itself. Once that is complete, you can use the libname like any other: proc contents data=test._all_ ; Or read any of the member elements (treated like SAS datasets) almost like any other: data tableonly; set test.member END=EOF; output; Writing to an XML dataset (even without the access=readonly parameter) is a bit more difficult: ERROR: XMLMap= has been specified on the XML Libname assignment. The output produced via this option will change in upcoming releases. Correct the XML Libname(remove XMLMap= option) and resubmit. Output generation aborted. Much like most of the SAS capabilities, everything you ever wanted to know about the SAS XML engine is available at http://support.sas.com/rnd/base/xmlengine/index.html OUR PROBLEM The reason behind this discussion of how SAS handles XML files came about from an internal request to process a vendor produced XML file. As is often the case, there was limited documentation and a short timeline. The vendor had already been contracted and data delivery beginning shortly. As a result, we had limited time to prepare and needed a flexible solution. And, unfortunately, there was no internal experience with these tools. Complicating the situation was the contents of the XML: literally hundreds of internal objects (in my Account/Owner/Phone number, each of those is an object ). Each of those objects maps to one dataset. 5

While the map can be manipulated to combine multiple XML objects into one dataset, at this point we did not know enough about the data itself to be able to make those decisions. We needed to process the data into a usable form (datasets) to be able to see and learn about it. Ultimately, this was to be a daily file running through an enterprise scheduler ( hands off no manual intervention) producing a full history over time. DEALING WITH ORDINALS OVER MULTIPLE FILES The catch with the ordinal fields used to connect the various objects is that they are only unique to a specific XML data file, not over time. Each day, the first record in each object has the ordinal value of one. But in order to build history, I need uniqueness over time. In appending today s data to yesterday s to build my history, the Ordinals need to change. There really is a simple solution: find yesterday s maximum Ordinal value for each table (either from history or a saved table; history is easier to maintain), add that maximum to today s generated values, and append the resulting today table to the existing history. A better approach would be to perform the analysis to build exactly the records you need in the form you need. But you have to take the time to understand the data to perform the analysis. Time we did not have. Shifting back to the GPX example, we can look at the proc contents for two of the objects: gpx (the main parent) and rte (route, one of the children). Error! Reference source not found. shows the contents of the gpx Object including the Ordinal unique key. Variable Type Len Format Informat creator Char 32 $32. $32. extensions Char 32 $32. $32. gpx Char 32 $32. $32. gpx_ordinal Num 8 F8. F8. version Char 32 $32. $32. Table 1. gpx Object Error! Reference source not found. shows the contents of the rte (route) Object which includes its Ordinal unique key as well as that of the parent. Variable Type Len Format Informat cmt Char 32 $32. $32. desc Char 32 $32. $32. extensions Char 32 $32. $32. gpx_ordinal Num 8 F8. F8. name Char 32 $32. $32. number Num 8 F8. F8. rte Char 32 $32. $32. rte_ordinal Num 8 F8. F8. src Char 32 $32. $32. 6

type Char 32 $32. $32. Table 2. rte (route) Object These are only two of the total of 19 objects (XML pseudo Datasets) in a GPX file that would result in creating essentially the same code to perform the Ordinal manipulation 19 times. While it is minor pain when dealing with 19 sets, can you imagine doing that for the hundreds involved with our Vendor s XML? I can t. I d admit it: I m lazy and I make mistakes. I d really rather not copy & paste & edit the same code 19 (or 190) times. And I d rather not have to repeat the process every time the file changes (or a new element appears; remember that we had no XSD). Since the same process applies for every one of the elements, why not let code do the work for me? USING SAS CODE TO GENERATE SAS CODE The mechanism to create self-modifying code within SAS is rather simple since it is an interpreted language. You use File, Put, and %include: filename sourcecd gpxxml_generated&datadate..sas"; data _null_; file sourcec2; set temp.maxvals end=eof; /* use put statements */ %include sourcec2; The maxvals dataset contains the maximum ordinals from yesterday. I create a file that contains SAS code, and by including it, cause execution of that new code. In detail: filename sourcec2 "xml_generated&datadate._2.sas"; data _null_; file sourcec2; set temp.maxvals end=eof; if (_n_ = 1) then do; put "libname temp 'temp';"; end; put "data temp." tableonly "; set " member "END=EOF;"; if prikey NOT = "" AND prival NOT =. then put prikey" = " prikey " + " prival";"; if parkey NOT = "" AND parval NOT =. then put parkey" = " parkey " + " parval";"; put "output;"; put ""; put "proc datasets; append base=output." tableonly " data=temp." tableonly "; "; if EOF then do; put " "; end; 7

This code will generate the following code that is included back into this program. I also could have created a full program and executed with a new sas command. The resulting code snippet defined in sourcec2: /* Libname test and output defined in main program */ libname temp '/export/home/myid/temp'; data temp.author ; set test.author END=EOF; METADATA_ORDINAL = METADATA_ORDINAL + 257 ; output; proc datasets; append base=output.author data=temp.author ; data temp.bounds ; set test.bounds END=EOF; METADATA_ORDINAL = METADATA_ORDINAL + 257 ; output; proc datasets; append base=output.bounds data=temp.bounds ; This code is executed by including it back into the main program. Each day, the generated code is a little different because the maximum ordinals (in the example above, the value 257) changes each day. That way, the history contains unique ordinals over time. AFTERTHOUGHTS While I could have merged the maximum ordinal into each record or used macro values for that maximum ordinal, this approach means I am only processing each file the least number of times. This is important when dealing with over 200 objects from an XML file that occupies over 6 Gigabytes (on a daily basis). The fewer passes through the data, the better. We encountered issues dealing with the raw files as well. In one case we encountered a bad tag that broke the XML engine. In order to remove it we had to resort to UNIX tools (one line of awk code to strip out the tag). We did explore the possibility of editing the file within SAS as a plain text file, having data records that exceeded 88,000 bytes caused difficulties. Although _infile_ was automatically set, it was limited to 32,767 bytes. Trying to read each record into three 32K character fields (with the possibility that the maximum record length could grow) and finding that string that could cross those fields would have required rather complex coding. By going outside the box, processing the 6 Gigabyte file took mere minutes. One disadvantage of the XML engine is that each object we want to process (convert to a dataset) requires the full XML file to be parsed. As a result, we are parsing that same file about 200 times. Because of the elapsed time and computational intensity of the parsing, the code was changed to create 10 rsubmit packages to take advantage of the multiple CPU on our servers. Although this imposed a high I/O load on the server, it dramatically reduced the elapsed time required. Looking back, it might have been better to save off the maximum ordinal for each object at the end of each run rather than getting it from the history. My concern was that it would be difficult to fall back if a rerun was needed and the difficulty in adding new objects in the future. In order to dynamically determine the objects (the resulting datasets) and the variable names for ordinals, I used proc contents. I could have used the SQL dictionary tables but was not familiar with them at the time. 8

Although these examples were written with version 1 of the XML engine, we moved to version 2, the only code change that was required was changing the libname: libname test xmlv2 xmlmap=sxlemap access=readonly; Of course, we did have to change the map to the new syntax but the new XML Mapper took care of most of that effort. I also noticed that the set nobs= statement did not work with XML. It compiles (no warning or error) and executes, but returns the value zero. I learned that while working on a presentation for PhilaSUG on NOBS. CONCLUSION The XML engine is a powerful tool for parsing these files. But it comes at the cost of having to parse large files multiple times. Dynamic code helps to minimize the computational cost while enhancing flexibility.. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: David B. Horvath, CCP Phone: 1-610-859-8826 Email: dhorvath@cobs.com Web: http://www.cobs.com/ SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 9