Greenspace: A Macro to Improve a SAS Data Set Footprint

Similar documents
Same Data Different Attributes: Cloning Issues with Data Sets Brian Varney, Experis Business Analytics, Portage, MI

Transitioning from Batch and Interactive SAS to SAS Enterprise Guide Brian Varney, Experis Business Analytics, Portage, MI

ABSTRACT INTRODUCTION MACRO. Paper RF

Data Manipulation with SQL Mara Werner, HHS/OIG, Chicago, IL

Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

Using SAS software to shrink the data in your applications

Taming a Spreadsheet Importation Monster

Ten tips for efficient SAS code

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ

Paper S Data Presentation 101: An Analyst s Perspective

SAS Macro Dynamics - From Simple Basics to Powerful Invocations Rick Andrews, Office of the Actuary, CMS, Baltimore, MD

Validation Summary using SYSINFO

Benchmark Macro %COMPARE Sreekanth Reddy Middela, MaxisIT Inc., Edison, NJ Venkata Sekhar Bhamidipati, Merck & Co., Inc.

Tales from the Help Desk 6: Solutions to Common SAS Tasks

Submitting SAS Code On The Side

Journey to the center of the earth Deep understanding of SAS language processing mechanism Di Chen, SAS Beijing R&D, Beijing, China

The Dataset Diet How to transform short and fat into long and thin

How to Create Data-Driven Lists

How to Keep Multiple Formats in One Variable after Transpose Mindy Wang

Tips for Mastering Relational Databases Using SAS/ACCESS

So Much Data, So Little Time: Splitting Datasets For More Efficient Run Times and Meeting FDA Submission Guidelines

Using PROC REPORT to Cross-Tabulate Multiple Response Items Patrick Thornton, SRI International, Menlo Park, CA

SAS Programming Techniques for Manipulating Metadata on the Database Level Chris Speck, PAREXEL International, Durham, NC

System Requirements. SAS Profitability Management 2.1. Server Requirements. Server Hardware Requirements

Extending the Scope of Custom Transformations

Automating Preliminary Data Cleaning in SAS

Why choose between SAS Data Step and PROC SQL when you can have both?

PharmaSUG 2018 Paper AD-08 Parallel Processing Your Way to Faster Software and a Big Fat Bonus: Demonstrations in Base SAS. Troy Martin Hughes

Using GSUBMIT command to customize the interface in SAS Xin Wang, Fountain Medical Technology Co., ltd, Nanjing, China

%MISSING: A SAS Macro to Report Missing Value Percentages for a Multi-Year Multi-File Information System

OLAP Drill-through Table Considerations

Clinical Data Visualization using TIBCO Spotfire and SAS

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

A Better Perspective of SASHELP Views

Programming & Manipulation. Danger: MERGE Ahead! Warning: BY Variable with Multiple Lengths! Bob Virgile Robert Virgile Associates, Inc.

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

BI-09 Using Enterprise Guide Effectively Tom Miron, Systems Seminar Consultants, Madison, WI

BreakOnWord: A Macro for Partitioning Long Text Strings at Natural Breaks Richard Addy, Rho, Chapel Hill, NC Charity Quick, Rho, Chapel Hill, NC

Mapping Clinical Data to a Standard Structure: A Table Driven Approach

SAS IT Resource Management Forecasting. Setup Specification Document. A SAS White Paper

SAS Macro Dynamics: from Simple Basics to Powerful Invocations Rick Andrews, Office of Research, Development, and Information, Baltimore, MD

Uncommon Techniques for Common Variables

SAS Scalable Performance Data Server 4.3

Paper An Automated Reporting Macro to Create Cell Index An Enhanced Revisit. Shi-Tao Yeh, GlaxoSmithKline, King of Prussia, PA

PREREQUISITES FOR EXAMPLES

ABSTRACT INTRODUCTION TRICK 1: CHOOSE THE BEST METHOD TO CREATE MACRO VARIABLES

ABSTRACT INTRODUCTION WORK FLOW AND PROGRAM SETUP

To conceptualize the process, the table below shows the highly correlated covariates in descending order of their R statistic.

Advanced SQL Processing Prepared by Destiny Corporation

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

Dictionary.coumns is your friend while appending or moving data

Automated Checking Of Multiple Files Kathyayini Tappeta, Percept Pharma Services, Bridgewater, NJ

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

Posters. Workarounds for SASWare Ballot Items Jack Hamilton, First Health, West Sacramento, California USA. Paper

CMISS the SAS Function You May Have Been MISSING Mira Shapiro, Analytic Designers LLC, Bethesda, MD

Omitting Records with Invalid Default Values

Checking for Duplicates Wendi L. Wright

SAS Macros for Grouping Count and Its Application to Enhance Your Reports

Macros for Two-Sample Hypothesis Tests Jinson J. Erinjeri, D.K. Shifflet and Associates Ltd., McLean, VA

Optimizing System Performance

Macro Quoting: Which Function Should We Use? Pengfei Guo, MSD R&D (China) Co., Ltd., Shanghai, China

Going Under the Hood: How Does the Macro Processor Really Work?

DATA Step in SAS Viya : Essential New Features

What Do You Mean My CSV Doesn t Match My SAS Dataset?

Know What You Are Missing: How to Catalogue and Manage Missing Pieces of Historical Data

Efficient Processing of Long Lists of Variable Names

Now That You Have Your Data in Hadoop, How Are You Staging Your Analytical Base Tables?

Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

Using a Control Dataset to Manage Production Compiled Macro Library Curtis E. Reid, Bureau of Labor Statistics, Washington, DC

A Macro to Keep Titles and Footnotes in One Place

Give me EVERYTHING! A macro to combine the CONTENTS procedure output and formats. Lynn Mullins, PPD, Cincinnati, Ohio

Using Templates Created by the SAS/STAT Procedures

A Practical Introduction to SAS Data Integration Studio

Quick Data Definitions Using SQL, REPORT and PRINT Procedures Bradford J. Danner, PharmaNet/i3, Tennessee

A Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA

PharmaSUG Paper TT11

Regaining Some Control Over ODS RTF Pagination When Using Proc Report Gary E. Moore, Moore Computing Services, Inc., Little Rock, Arkansas

Green Eggs And SAS. Presented To The Edmonton SAS User Group October 24, 2017 By John Fleming. SAS is a registered trademark of The SAS Institute

The SERVER Procedure. Introduction. Syntax CHAPTER 8

Parallel Data Preparation with the DS2 Programming Language

David Ghan SAS Education

So, Your Data are in Excel! Ed Heaton, Westat

PhUSE Paper CC07. Slim Down Your Data. Mickael Borne, 4Clinics, Montpellier, France

SAS 9 Programming Enhancements Marje Fecht, Prowerk Consulting Ltd Mississauga, Ontario, Canada

SAS Visual Analytics Environment Stood Up? Check! Data Automatically Loaded and Refreshed? Not Quite

Better Metadata Through SAS II: %SYSFUNC, PROC DATASETS, and Dictionary Tables

A Few Quick and Efficient Ways to Compare Data

100 THE NUANCES OF COMBINING MULTIPLE HOSPITAL DATA

Advanced Visualization using TIBCO Spotfire and SAS

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

Leveraging SAS Visualization Technologies to Increase the Global Competency of the U.S. Workforce

A Dynamic Imagemap Generator Carol Martell, Highway Safety Research Center, Chapel Hill, NC

Text Generational Data Sets (Text GDS)

Using SAS/SCL to Create Flexible Programs... A Super-Sized Macro Ellen Michaliszyn, College of American Pathologists, Northfield, IL

Ranking Between the Lines

WHAT ARE SASHELP VIEWS?

Big Data? Faster Cube Builds? PROC OLAP Can Do It

Open Problem for SUAVe User Group Meeting, November 26, 2013 (UVic)

CC13 An Automatic Process to Compare Files. Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

Transcription:

Paper AD-150 Greenspace: A Macro to Improve a SAS Data Set Footprint Brian Varney, Experis Business Intelligence and Analytics Practice ABSTRACT SAS programs can be very I/O intensive. SAS data sets with inappropriate variable attributes can degrade the performance of SAS programs. Using SAS compression offers some relief but does not eliminate the issue of inappropriately defined SAS variables. This paper intends to examine the problems inappropriate SAS variable attributes can cause as well as a macro to tackle the problem of minimizing the footprint of a SAS Data Set. INTRODUCTION This paper intends to examine the problem of working with SAS Data Sets that have inappropriate variable attributes. This paper will also examine a SAS macro to remedy the situation by altering character variable attributes to only use as much storage space as necessary based on the variable values in the data. This paper does not address shortening the length of numeric variables as it could lead to problems in numeric precision. THE PROBLEM The focus on this paper is about character variables that are assigned storage lengths that are larger than needed. For example, there are times that data ends up in a database that have most of the character variables defined as a variable length character with a long storage length. i.e. gender stored as a varchar 500. Even though the database can handle this efficiently, when the file is being processed in SAS, it is treated as a fixed length character variable. As far as storage is concerned, one can turn on the compression option but the savings only occur during reading and/or writing the SAS data set to disk. During data processing, the variables are expanded and treated with their full storage length. SAMPLE DATA SAS DATA SET WITH POORLY DEFINED VARIABLE ATTRIBUTES The data set used for this paper is not very large compared to what is encountered in real projects. This data set has 500,000 records and 10 variables. * option to show diagnostics.; options fullstimer; * turn on data set compression; options compress=yes; * set number of records for sample data set; %let nobs=500000; ** Data set with unnecessarily long character fields.; data with_long_vars; length gender race city county state country firstname lastname nickname favoritecolor $500; do id=1 to &nobs.; output; end; NOTE: The data set WORK.WITH_LONG_VARS has 500000 observations and 11 variables. NOTE: Compressing data set WORK.WITH_LONG_VARS decreased size by 99.44 percent. Compressed is 70 pages; un-compressed would require 12500 pages. 1

NOTE: DATA statement used (Total process time): real time 2.22 seconds user cpu time 1.77 seconds system cpu time 0.10 seconds memory 2113.53k OS Memory 9344.00k Timestamp 08/05/2015 08:48:50 AM Step Count 1 Switch Count 0 SAMPLE DATA SAS DATA SET WITH MORE APPROPRIATELY DEFINED VARIABLE ATTRIBUTES data with_shorter_vars; length gender $7 race $40 city $100 county $100 state $100 country $100 firstname $100 lastname $100 nickname $100 favoritecolor $40; do id=1 to &nobs.; output; end; NOTE: Compressing data set WORK.WITH_SHORTER_VARS decreased size by 97.51 percent. Compressed is 154 pages; un-compressed would require 6173 pages. NOTE: DATA statement used (Total process time): real time 0.41 seconds user cpu time 0.37 seconds system cpu time 0.03 seconds memory 580.68k OS Memory 8932.00k Timestamp 08/05/2015 08:49:12 AM Step Count 4 Switch Count 0 Notice in the above two examples that the data set with the more appropriately defined attributes is much more efficient. Even though compression is turned on, it is much less efficient to have character variables with storage lengths that are longer than necessary. The data set above is only 500,000 records and 10 variables. In practice data with millions of rows and hundreds of variables is very common. Metric Long Variables Short Variables File Size 13.9MB 9.7MB Real Time 2.22 seconds 0.41 seconds THE SOLUTION There are a few solutions to this problem of poorly defined SAS data set variable attributes. 1) Get the attributes changed in the source. This is the best case in that it is easiest for you. However, the entity providing you the data may not be willing, able, or have the time to make the changes. 2) Build a SAS view between the data source and your SAS data set environment. 3) Use the greenspace macro to minimize the footprint of your SAS data set(s) 2

BUILD A SAS VIEW AGAINST THE DATA SOURCE One method to reduce the impact on processing is to build a SAS View in between the data source and your target data library. A view is a set of instructions on how to obtain the data. Every time you use the view, the data is pulled from the source. proc sql; create view wlv_view as select id, gender length=7, race length=40, city length=100, county length=100, state length=2, country length=100, firstname length=100, lastname length=100, nickname length=100, favoritecolor length=40 from with_long_vars; quit; THE GREENSPACE MACRO In the interest of minimizing the footprint for a data set for your project data library, it would be best to have an automated process (macro) to calculate the minimal length necessary for storing the data in each of your character variables. The following is a high level overview of the steps that the greenspace macro carries out. 1) Reads the metadata for the input SAS data set and stores the variable names, types, lengths, formats, and labels into sequential macro variables. 2) Uses PROC SQL to read through the input SAS data set and store the maximum character value for each character variable. If the variable is numeric, it just uses the original length. 3) A SAS data step is used to write out the data set to the output data set designated in the macro parameter. The lengths from step 2 are used in a length statement prior to the set statement to force the new more efficient storage length. The format is also adjusted to match the new storage length for the character variables. 4) The informats are wiped out since they no longer serve a purpose. 5) PROC CONTENTS are run on the original input and new output SAS data sets for review. The full documented code follows. /*--------------------------------------------------------- Program: greenspace.sas Programmer: Brian Varney Date: Purpose: Shorten the storage length of character variables to the longest value in that variable within the input data set. Parameters: DSIN: The input data set name. this can be a one or two level name. DSOUT: The output data set name. This can be a one or two level name. Cautions: 3

I would not recommend making the output data set the same as the input data set. If something did not go as planned, you could lose the input data set. If a character variable has a user defined format applied, it will be over written by a more generic format. If there are indexes on the data set they will be removed when the data set is re-written. ----------------------------------------------------------*/ %macro greenspace(dsin=,dsout=); ** Keep as many macro variables local to the macro as **; ** possible. **; %local libn memn numvars i; ** Split the input data set name into the two levels if **; ** appropriate. **; %if %index(&dsin.,.)=0 %then %let libn=work; %let memn=%upcase(&dsin.); %else %let libn=%upcase(%scan(&dsin.,1,.)); %let memn=%upcase(%scan(&dsin.,2,.)); %put &libn. &memn.; ** Grab all of the variable information and put into **; ** macro variables. **; proc sql noprint; select name, type, length, format, label into :name1-:name99999, :type1-:type99999, :length1-:length99999, :format1-:format9999, :label1-:label9999 from dictionary.columns where libname="%upcase(&libn.)" and memname="%upcase(&memn.)" ; quit; %let numvars=&sqlobs.; ** Calculate the maximum character value length for each **; ** variable and grab the current storage length for each **; ** numeric variable. **; proc sql; create table &memn._maxlengths as select 4

%do i=1 %to &numvars.; max(length(&&name&i.)) as &&name&i. %else &&length&i. as &&name&i. %if &i ne &numvars. %then, from &dsin. ;quit; ** Transpose the data so we can load the desired **; ** variable storage lengths into macro variables. **; proc transpose data=&memn._maxlengths out=&memn._maxlengthsv(rename=(col1=maxlength)) name=varname; ** Load the desired variable lengths into macro **; ** variables. **; proc sql noprint; select maxlength into :maxlength1-:maxlength99999 from &memn._maxlengthsv; quit; ** Apply the new character variable storage lengths to **; ** the output data set. Also remove the informats from **; ** the character variables as they would not longer make **; ** sense with the storage length. If there was a user **; ** defined format applied to the variable, it will be **; ** updated with a generic character variable format. **; data &dsout.; length %do i=1 %to &numvars.; &&name&i. $&&maxlength&i. %else &&maxlength&i. ; set &dsin.; format %do i=1 %to &numvars.; 5

&&name&i. $&&maxlength&i... ; informat %do i=1 %to &numvars.; &&name&i. ; ** Examine the input and output data set attributes to **; ** ensure you are getting what you are expecting. **; proc contents data=&dsin.; title1 "Contents of &dsin. Before Greenspace"; proc contents data=&dsout.; title1 "Contents of &dsin. After Greenspace"; ** Reset the titles and footnotes. **; title1; footnote1; %mend greenspace; CONCLUSION Knowing your data attributes is important in addition to knowing your data content. By storing SAS data with appropriate variable attributes, poor performance and confusion can be avoided. There are times when this method could cause chaos. For example, if you need your variable attributes to be consistent as new data is added or removed, this approach may not be the best. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Brian Varney: Experis 5220 Lovers Lane, Suite 200 Portage, Michigan 49002 Office: (269) 553-5185 Fax: (269) 553-5101 E-mail: brian.varney@experis.com Web: www.experis.us/analytics SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 6