Validating And Updating Your Data Using SAS Formats Peter Welbrock, Britannia Consulting, Inc., MA

Similar documents
Merge Processing and Alternate Table Lookup Techniques Prepared by

SAS 101. Based on Learning SAS by Example: A Programmer s Guide Chapter 21, 22, & 23. By Tasha Chapman, Oregon Health Authority

using and Understanding Formats

44 Tricks with the 4mat Procedure

Advanced Tutorials. Paper More than Just Value: A Look Into the Depths of PROC FORMAT

PROC FORMAT Jack Shoemaker Real Decisions Corporation

Ten Great Reasons to Learn SAS Software's SQL Procedure

Using a Picture Format to Create Visit Windows

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

The FORMAT procedure - more than just a VALUE statement Lawrence Heaton-Wright, Quintiles, Bracknell, UK

Going Under the Hood: How Does the Macro Processor Really Work?

Format-o-matic: Using Formats To Merge Data From Multiple Sources

Abstract. Introduction. Adding Extensions to the SAS/Warehouse Administrator Peter R. Welbrock Strategic Information Systems, Inc.

Getting Up to Speed with PROC REPORT Kimberly LeBouton, K.J.L. Computing, Rossmoor, CA

Lecture 05 I/O statements Printf, Scanf Simple statements, Compound statements

Using an Array as an If-Switch Nazik Elgaddal and Ed Heaton, Westat, Rockville, MD

Formats, Informats and How to Program with Them Ian Whitlock, Westat, Rockville, MD

A Quick and Gentle Introduction to PROC SQL

Introduction to SAS Mike Zdeb ( , #61

Text Search & Auto Coding

Macro Architecture in Pictures Mark Tabladillo PhD, marktab Consulting, Atlanta, GA Associate Faculty, University of Phoenix

Lecture 1 Getting Started with SAS

PROC REPORT Basics: Getting Started with the Primary Statements

PROC FORMAT: USE OF THE CNTLIN OPTION FOR EFFICIENT PROGRAMMING

Basic Concept Review

SAS/Warehouse Administrator Usage and Enhancements Terry Lewis, SAS Institute Inc., Cary, NC

An Introduction to SAS/FSP Software Terry Fain, RAND, Santa Monica, California Cyndie Gareleck, RAND, Santa Monica, California

Infographics and Visualisation (or: Beyond the Pie Chart) LSS: ITNPBD4, 1 November 2016

Guide Users along Information Pathways and Surf through the Data

Debugging 101 Peter Knapp U.S. Department of Commerce

How to Create Data-Driven Lists

Using Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY

Best Practice for Creation and Maintenance of a SAS Infrastructure

WHAT IS GOOGLE+ AND WHY SHOULD I USE IT?

Introduction to the SAS Macro Facility

Macros I Use Every Day (And You Can, Too!)

Data Manipulation with SQL Mara Werner, HHS/OIG, Chicago, IL

Using an ICPSR set-up file to create a SAS dataset

Paper PO06. Building Dynamic Informats and Formats

Chapter 3: The IF Function and Table Lookup

Q &A on Entity Relationship Diagrams. What is the Point? 1 Q&A

New Perspectives on Access Module 5: Creating Advanced Queries and Enhancing Table Design

Square Peg, Square Hole Getting Tables to Fit on Slides in the ODS Destination for PowerPoint

Paper B GENERATING A DATASET COMPRISED OF CUSTOM FORMAT DETAILS

Understanding Recursion

Create Custom Tables in No Time

Chapter 9: Dealing with Errors

If Statements, For Loops, Functions

PROC FORMAT. CMS SAS User Group Conference October 31, 2007 Dan Waldo

Let the CAT Out of the Bag: String Concatenation in SAS 9

April 4, SAS General Introduction

Hash Objects for Everyone

Working with Objects. Overview. This chapter covers. ! Overview! Properties and Fields! Initialization! Constructors! Assignment

Topic C. Communicating the Precision of Measured Numbers

Anatomy of a Merge Gone Wrong James Lew, Compu-Stat Consulting, Scarborough, ON, Canada Joshua Horstman, Nested Loop Consulting, Indianapolis, IN, USA

Introduction to Excel

Building and Using User Defined Formats

Access Intermediate

Create a SAS Program to create the following files from the PREC2 sas data set created in LAB2.

ABSTRACT MORE THAN SYNTAX ORGANIZE YOUR WORK THE SAS ENTERPRISE GUIDE PROJECT. Paper 50-30

A Guided Tour Through the SAS Windowing Environment Casey Cantrell, Clarion Consulting, Los Angeles, CA

Reducing SAS Dataset Merges with Data Driven Formats

Unlock SAS Code Automation with the Power of Macros

An Introduction to SAS University Edition

CMU MSP : SAS FORMATs and INFORMATs Howard Seltman October 15, 2017

10 The First Steps 4 Chapter 2

Accessing Data and Creating Data Structures. SAS Global Certification Webinar Series

GOOGLE ANALYTICS 101 INCREASE TRAFFIC AND PROFITS WITH GOOGLE ANALYTICS

SAS Business Rules Manager 1.2

Using Parameter Queries

Be Your Own Task Master - Adding Custom Tasks to EG Peter Eberhardt, Fernwood Consulting Group Inc. Toronto, ON

APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software

School of Computer Science CPS109 Course Notes Set 7 Alexander Ferworn Updated Fall 15 CPS109 Course Notes 7

INTRODUCTION to SAS STATISTICAL PACKAGE LAB 3

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 2 Working with data in Excel and exporting to JMP Introduction

A Practical Introduction to SAS Data Integration Studio

UAccess ANALYTICS Next Steps: Working with Bins, Groups, and Calculated Items: Combining Data Your Way

Personally Identifiable Information Secured Transformation

Crystal Reports 7. Overview. Contents. Parameter Fields

The Dynamic Typing Interlude

GOOGLE APPS. GETTING STARTED Page 02 Prerequisites What You Will Learn. INTRODUCTION Page 03 What is Google? SETTING UP AN ACCOUNT Page 03 Gmail

Have a Strange DATE? Create your own INFORMAT to Deal with Her Venky Chakravarthy, Ann Arbor, MI

(Refer Slide Time 01:41 min)

Level 6 Relational Database Unit 3 Relational Database Development Environment National Council for Vocational Awards C30147 RELATIONAL DATABASE

1. Join with PROC SQL a left join that will retain target records having no lookup match. 2. Data Step Merge of the target and lookup files.

Table Lookups: From IF-THEN to Key-Indexing

The OLAPCONTENTS Procedure Shine the light onto your OLAP Cubes Jerry Copperthwaite, SAS Institute, Cary, NC

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

Bulk Registration File Specifications

Depending on the computer you find yourself in front of, here s what you ll need to do to open SPSS.

MAXQDA and Chapter 9 Coding Schemes

Journey to the center of the earth Deep understanding of SAS language processing mechanism Di Chen, SAS Beijing R&D, Beijing, China

Create a Format from a SAS Data Set Ruth Marisol Rivera, i3 Statprobe, Mexico City, Mexico

Become a Champion Data Modeler with SQL Developer Data Modeler 3.0

FAQ: Privacy, Security, and Data Protection at Libraries

What s New in SAS Studio?

Database Management System Dr. S. Srinath Department of Computer Science & Engineering Indian Institute of Technology, Madras Lecture No.

Mastering the Basics: Preventing Problems by Understanding How SAS Works. Imelda C. Go, South Carolina Department of Education, Columbia, SC

Chapter 1. Data types. Data types. In this chapter you will: learn about data types. learn about tuples, lists and dictionaries

Using Microsoft Excel

Transcription:

Validating And Updating Your Data Using SAS Formats Peter Welbrock, Britannia Consulting, Inc., MA Overview In whatever way you use SAS software, at some point you will have to deal with data. It is unavoidable. The trouble is with data is that it is not always: Valid Correct Fortunately, SAS has a myriad of tools to deal with such issues. One such tool is formats, which allows to you both view your data in another form, or to transform it based upon pre-determined rules i.e. those established when the format was created. This paper will cover the basic uses of formats within SAS. Rather than just approaching formats from a syntactical perspective, it will be approached using examples of their use. Syntax is easy to reference, but practical uses are harder to find. Both the use and the creation of formats will be used. Formats that are part of SAS software itself will be briefly covered followed by those created by the user. Emphasis will be given to both the creation and the use of the format. What is a Format? A format is a stored set of rules that can be used to restructure the cardinality of a column. This restructuring can take place in either viewing the data, or by recoding the data. These rules can either be determined by the user (a user-defined format), or by SAS software itself. This is a very simplistic definition of a tool that is extremely powerful and useful, but one that will suffice for an introduction to the subject. Rather than a definition of what a format actually is, more useful will be what it can be used for. This paper will be concentrating on the latter. Data for Examples For each of the examples, we will be using the same data. This will make following each example easier. This data will be rich in formatable data so that each of the techniques will be applicable. The data will look as follows: Column Description Customer_num C Unique Customer id Age N Customer Age Gender C Customer Gender Classification C Customer Type birthdate N SAS date value zip C Zip Code phone N Phone Number Orders N Number of Orders value N $ Value of Orders Table: Customer_data The actual data values are predictable given the column descriptions, and are rich in potential for main topics of this paper: the validation, recoding (transformation) and lookup of data. Viewing Data In a Different Form (Using inherent SAS Formats) This is one of the most elementary uses of formats. Data is stored in one way, but is required to be viewed in another. For example, in the customer_data file, the birth date is stored as a SAS date, but using SAS date time formats, could be viewed or transformed into almost any form. For example, if the customer was born on January 1, 1960, then because it is a SAS date, the stored value would be 0 (zero). If the birth month is required from this date, then it would not be necessary to programatically work this out, but a SAS format could be used. Many formats are supplied with SAS software and are therefore available to anyone using the software. It is very

worthwhile exploring the different categories of formats SAS supplies to understand the general ways in which they can be used, even if every single available format is not learned. Key categories (with format examples) are as follows (see the online documentation for a complete list): Numeric Formats! PERCENTw.d (convert percentages to numeric values)! COMMAw.d (removes embedded characters) Character Formats! $UPCASEw. (convert character data to upcase)! $QUOTEw. (writes data values enclosed in double quotation marks), Date and Time Formats! DAYw. (writes day of week)! DOWNAMEw. (writes data value as the name of the day of the week)! MONYYw. (writes date values in the form mmmyy or mmmyyyy)! MONNAMEw. (writes a date value as the name of month) The above list merely illustrates the types of formats that SAS supplies. The entire list is very long, but one that is worthwhile perusing. SAS formats are used to write-out existing data in another form. SAS also has the concept of an informat that is a mirror image of a format. Instead of writing out as a format will do, an informat will read-in data in a particular form. Note the final example in the list above: monname. As the format name suggests, this format will write out the month name from a SAS date value. The following snippet of code will illustrate the how this can be used to display the month instead of the value 0 (zero). proc print data=customer_data noobs; var customer_num birthdate; format birthdate monname9.; The output from the Print procedure will contain two columns: one containing the stored value of the customer_num column, and the other containing the 9 character month corresponding the date the customer was born. Note: most SAS procedures will, by default, operate on the formatted value of a column. Make sure that you read and understand how the formatted values are used before applying formats to columns in SAS procedures. So, up to this point, we have seen the types of inherent formats SAS supplies. We have also seen a simple example of how a format can be applied using a very basic SAS procedure. Of course, use of formats is not limited to procedures, but can also be used within the data step construct, or SQL. The simple example above illustrated the process of viewing data in a different form than it is stored. In this case, we converted a SAS date value to a 9-character month value for display purposes only. We also applied the format on an as needed basis, rather than applying it permanently to the column. Transforming (recoding) Data Having seen how a format can be used to view data in a different form than it is stored, the same principle can be used to actually transform data. Rather than using predetermined inherent SAS formats for this, an example of user defined formats will be used. Supposing the column classification in the customer_data table is to be changed into something more meaningful (maybe for analysis purposes). The way the data is stored currently is at too granular a level and it needs to be grouped into more convenient levels. Obviously, SAS itself will not have a convenient format to perform this change. There is, however, a SAS procedure that can be used to create a format. Not surprisingly, this procedure is called PROC FORMAT.

There are many different options available for specifying which values are included in a particular formatted group. There are many short cuts available. Although beyond the scope of this paper, it is definitely worth reading the online documentation to understand these options i. The following code will create a format that can then be used in the same way a SAS supplied format can be applied. proc format library=work; (1) value $ c_type (2) XA, XB = Books (3) XC XZ = Magazines (4) YA - YZ = E books ; When run, this code will create a format that will be stored in the work library (line 1). This means, as with the rest of SAS software when storing an object in the work library, that it will only be available for the length of the SAS session. Once the session is ended, the ability to use this format without creating it again will be removed. The format is actually stored in a SAS catalog within the work library. This catalog is always named formats and will automatically be created by SAS. If the format were to be needed across SAS sessions, then it should be permanently stored. This would be done by specifying a library that will be recreated in subsequent SAS sessions (e.g. library=mylib). The name of the format (see line 2) is c_type and it is a character format. It is very important to understand the distinction between character and numeric formats. Character formats can only be used on character data. They are always preceded by a $ upon their use and the format name is limited to seven characters (eight including the $). Numeric formats can only be used on numeric data and the format name can be up to eight characters. Note that if a numeric format were being created, the only difference would be that the $ on the value statement would be removed. Obviously, the values that make up the formatted groupings would also have to be numeric in nature. The values that make up the format groupings (see line 3) can then be listed. In line 4, an optional technique is to use the - syntax that means that any values in the range from XC to XZ will be included within the Magazine group. The format is now created. It can now be used to transform the classification column. In the example below, the format is being used to create a new column (gen_class). data new_cust; set customer_data; length gen_class $9; gen_class=put(classification,$c_type); This is a very simple example of using the put() function to recode a column. It would, of course, have been possible to perform the same actions with if-then-else SAS language constructs, but this would be both lengthy to type and difficult to maintain. A warning about using the put() function is that it always returns a character string. This might not be what is required, so be careful in its use ii. One of the benefits of using a format in the example outlined above is that it reduces the amount of code that needs to be maintained. Instead, only the format needs to be maintained. This can be done in a couple of ways: Update the Proc Format code that created the format and re-run it. Store the format information in a data table, update the table as needed then recreate the data. If there is choice between these two methods, the last one is preferable. To use it, however, a further feature of Proc Format is required, the use of the cntlout and cntlin options. The cntlin options facilitates the translation of data within a SAS file directly into a

format. The cntlout option does the opposite, it takes a SAS format and translates it into a SAS data file. The following code will take the format $c_type and store it as a SAS table. proc format library=work cntlout=c_type; select $c_type; The cntlout option specifies the name of the SAS table where the format information will be stored. In this case, a temporary SAS table called c_type will be created. The select statement specifies the format(s) to be extracted. Note that if more than one format is specified in the select statement, then all of the information will be placed into the single cntlout= SAS file. If the select statement is not included, then all the formats contained within the catalog speicified by the library= option will be extracted to the output SAS file. Note that the $c_type format will still exist. Using the cntlout option will not remove the format. Once the format information is extracted, then it can be updated just like any other SAS table. It is important to remember, however, that the structure of the table must be kept intact if the format is to be recreated from the table. The c_type table will have many columns (run a Proc Contents against the output file to see a full list), but the following are some of the key ones: Column start end fmtname type label Description Starting Value for the the format grouping Ending value for the format grouping Name of the format Type of format (character or numeric) Format Value Label These have been selected as the key columns since to perform the opposite function (create a format from a table), these are the columns that are minimally required (specifically, end is not really needed unless a range is present). In our example, using the $c_type format, to add another format grouping, we would edit the outputted c_type SAS table, adding a new row corresponding to the new format group and then submit the following code: proc format library=work cntlin=c_type; This will write over the existing $c_type format, incorporating any changes made within the data table. This method of updating or creating formats is incredibly useful, since it allows the SAS user to use existing data to create formats without having to write multiple lines of code. It also means that formats can be easily maintained, since they will be leveraged off any changes made to data files. For example, if a format was created to collapse general ledger accounts into summary accounts, any time a new account was created, the file used to classify the accounts could be used as the basis for the input into a Proc Format with the cntlin option. Using the illustrations above, user defined formats have been used to re-classify data. The cardinality of a column within a table has been changed based upon the format grouping defined within the user created SAS format. In the example above, a new column was created based upon the formatted values of the classification column. Be careful when transforming a column based upon format because the transformation is one way. It is always safer to create a new column so that the original data remains intact. There is no concept of de-formatting! An Alternative to Look-up Tables Another very powerful use of formats is as an alternative to look-up tables. Using our existing customer_data file, suppose we

want to include some basic demographic data based upon the customer_num and zip. There could be two look-up tables, one for the customer demographic information, and one for the zip code information. The example we have is to compare the customer s income with the median income within their zip code. This could be done with the following code: proc sql; create table cust_demographic as select cust.*, zip.median_income, c_dem.median from customer_data cust, zip_demographic zip, customer_demog c_dem where cust.customer_num=zip.customer_id and cust.customer_num=c_dem.customer_id; quit; This could, depending on the size of the files, become very resource intensive. An alternative would be to create formats. In this case, there would be two formats created as follows: A format from the zip_demographic file with the customer_id as the start value and median_income as the value. The format will be called $zip_inc. A format from the customer_demog file with the customer_id as the start value and median as the value. The format will be called $c_inc. The following code would perform the same task as the SQL above: proc sql; create table cust_demographic as select cust.*, put(customer_num,$zip_inc.) as median_income, put(customer_num,$c_inc.) as median from customer_data cust; quit; The need to actually perform joins is removed completely. Of course, as with everything, one does not get something for nothing. The formats have to be created which takes up resources. The creation of the two new columns median and median_income will also take up resources. There is no hard set of rules as to using a lookup table or a format is more efficient. A great deal depends on the following: The computing resources available. SAS formats are loaded into memory, so very large formats might not be a good idea. How often the lookup will be performed. The more often, the more sense it makes to create a permanent format. How static the format will be. If the format grouping change very often, it might be more trouble than it is worth to keep a permanent format. If many attributes are required for a single key value. The example above illustrates the use of two distinct formats. What if five, or ten, or fifteen attributes were needed, would using formats then make sense? iii There are instances, however, that using formats as lookups can be surprisingly efficient. It is a technique that should be looked into, even though it is a proprietary to SAS. Note that there are potential gains in efficiency by creating format groupings in a specific order. If you know that your data has certain values with high frequency, then there is a benefit to storing the applicable format group toward the start of the format. Since SAS will store the format groups in order of the label, the notsorted option on the value statement will ensure that the format values will be stored in the order in which they have been placed. Be wary of this option, however, since it will affect the output behavior of several SAS procedures when output is to be sorted by the formatted value of a column, rather than its actual value. Data Validation One of the key problems that every SAS programmer comes across is the validity of

data. The temptation is for full speed ahead on the analysis and damn the torpedoes. This is a dangerous and counter-productive tendency since few of us work in an environment where we can make assumptions about the validity of data. In fact, the only assumption we can safely make is that somewhere in the data, there are problems lurking that will surface only after we hand in the final report. Problems with data necessitate an approach that will help us both get to know the data (still essential, despite all the modern technological tools available) and uncover any problems. Using formats is one method that can help in uncovering unexpected data. The restriction with formats is that they will be based upon some a priori knowledge of that data iv. One form of data validation is finite validation. This can be used when there are a known number of possibilities for a value in a given column. In customer_data, the column gender can only have two values: 0 or 1. This is a very simple validation that will be easy to perform with a format. The format could be created as follows: proc format library=work; value $ chk_gen 0 = female 1 = male other= error ; With this simple format, the data can then be checked. An example of this is illustrated in the following snippet of code: data check(drop=chk_gen); set customer_data; length chk_gen $6; chk_gen=put(gender,$chk_gen.); if chk_gen = error then do; put gender error id: customer_num; output; end; Any customers that have an invalid gender will be both listed in the log and the entire row from the customer_data file will be outputted to the check file. This is a very simple example of using a format to test for the validity of data, but one that can be extended to become very complex. An additional trick in using formats to validate data is to use nested formats v. Suppose that we want to validate for very high or low value values. We might, for example be suspicious of any value below or including $100 or above $100,000. We want to see these rows explicitly, but otherwise want the value printed with the $ sign and commas. We could create the following format: proc format library=mylib; (1) value chk_val (2) low-100= too low (3) 100 <- 100000 =[dollar16.] (4) 100000 <- high = too high ; There are a few differences from previous examples we have used: First of all, we are creating a permanent format (see line 1). We are storing the format in a catalog called formats that is referenced by the mylib libname. Secondly, we are creating, for the first time, a numeric format i.e. a format that will be used on data stored within a SAS numeric field or on numeric data itself (see line 2, that has a value statement without a $). We are using the low option in the range. In this case (line 3) this means that any value below and including 100 will be included in this grouping. In line 4, any value that is between 100 and 100000 (excluding the value 100) will have another format applied. In this case, it is the SAS supplied dollar16 format which writes out the numeric value with dollar signs, commas and decimal points

Line 4 really includes the clever piece of this code. The ability to nest formats can save the programmer a vast amount of time, both in the validation of data, and in any other way that the format might be used. The syntax of the nested format, however, is important with square brackets being essential. Sundry Format Topics Picture Statement One aspect of formats that has not been discussed so far is that using the picture statement. This enables the SAS programmer to use a format as a mask. The picture statement will allow, for instance, telephone numbers, or social security numbers stored without dashes, to be displayed with dashes. Note that a picture can only be used when the data is numeric. The picture statement can best be described by example, using the phone column in the customer_data table vi. proc format library=work; picture phonenum; low-high = (999)999-9999 (prefix = ( ); Now, whenever the picture is applied to the phone column, the data will be viewed in the form: (555)555-5555 Conclusions Know your data! It is important to really know your data when creating such a format. It is easy to mistreat data when applying formats, without even knowing you are doing so. When dealing with very large data, make sure that you test to see when a format might or might not be efficient. There are no simple rules. Using formats should be one weapon in the artillery. Think about permanently storing formats if at all possible. There is an overhead to their creation. If the underlying format groups change continuously, think about creating formats directly from tables (using the cntlin option) rather than through code. Know your data! i Some great examples of this can be found in Pete Lund s SUGI 25 paper, entitled: More than Just Value: A Look Into the Depths of PROC FORMAT. ii See Jack Shoemaker s SUGI 25 paper entitled: Eight PROC FORMAT Gems. Specifically look at Jack s tip number 4. iii For a discussion on multiple attributes, see Jack Shoemaker s : Eight PROC FORMAT Gems. Specifically look at Jack s tip number 8. iv For a discussion on the validation of data, see: Peter R. Welbrock, Strategic Data Warehousing Principles Using SAS Software, Cary, NC: SAS Institute Inc., 1998 384pp. See Chapter 6. v This trick was inspired by Jack Shoemaker s tip number 7, in his SUGI 25 paper: Eight PROC FORMAT Gems vi The picture statement is well explained in detail Pete Lund s more advanced paper from SUGI 25: More than Just Value: A Look Into the Depths of PROC FORMAT. Contact Information Please feel free to contact the author with comments/suggestions/abuse: Peter Welbrock Britannia Consulting, Inc RR2 #1132 Vineyard Haven, MA 02568 Welbrock@erols.com Trademark Information SAS is a registered trademark of SAS Institute, Inc., Cary, NC, USA