Efficient Processing of Long Lists of Variable Names

Similar documents
Arthur L. Carpenter California Occidental Consultants, Oceanside, California

Taming a Spreadsheet Importation Monster

Posters. Workarounds for SASWare Ballot Items Jack Hamilton, First Health, West Sacramento, California USA. Paper

BreakOnWord: A Macro for Partitioning Long Text Strings at Natural Breaks Richard Addy, Rho, Chapel Hill, NC Charity Quick, Rho, Chapel Hill, NC

WHAT ARE SASHELP VIEWS?

How to Create Data-Driven Lists

Table Lookups: From IF-THEN to Key-Indexing

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

Get into the Groove with %SYSFUNC: Generalizing SAS Macros with Conditionally Executed Code

Tales from the Help Desk 6: Solutions to Common SAS Tasks

Quick Data Definitions Using SQL, REPORT and PRINT Procedures Bradford J. Danner, PharmaNet/i3, Tennessee

Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

SQL Metadata Applications: I Hate Typing

So Much Data, So Little Time: Splitting Datasets For More Efficient Run Times and Meeting FDA Submission Guidelines

Quick and Efficient Way to Check the Transferred Data Divyaja Padamati, Eliassen Group Inc., North Carolina.

A Second Look at SAS Macro Design Issues Ian Whitlock, Kennett Square, PA

1. Join with PROC SQL a left join that will retain target records having no lookup match. 2. Data Step Merge of the target and lookup files.

Program Validation: Logging the Log

Data Quality Review for Missing Values and Outliers

Tips & Tricks. With lots of help from other SUG and SUGI presenters. SAS HUG Meeting, November 18, 2010

Why Hash? Glen Becker, USAA

Open Problem for SUAVe User Group Meeting, November 26, 2013 (UVic)

Developing Data-Driven SAS Programs Using Proc Contents

%MISSING: A SAS Macro to Report Missing Value Percentages for a Multi-Year Multi-File Information System

USING SAS HASH OBJECTS TO CUT DOWN PROCESSING TIME Girish Narayandas, Optum, Eden Prairie, MN

Validation Summary using SYSINFO

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

How to Keep Multiple Formats in One Variable after Transpose Mindy Wang

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

SAS File Management. Improving Performance CHAPTER 37

Submitting SAS Code On The Side

Uncommon Techniques for Common Variables

... ) city (city, cntyid, area, pop,.. )

3. Almost always use system options options compress =yes nocenter; /* mostly use */ options ps=9999 ls=200;

Omitting Records with Invalid Default Values

ABSTRACT. Paper CC-031

Using ANNOTATE MACROS as Shortcuts

Essential ODS Techniques for Creating Reports in PDF Patrick Thornton, SRI International, Menlo Park, CA

Macro Facility. About the Macro Facility. Automatic Macro Variables CHAPTER 14

Identifying Duplicate Variables in a SAS Data Set

SD10 A SAS MACRO FOR PERFORMING BACKWARD SELECTION IN PROC SURVEYREG

PROC REPORT Basics: Getting Started with the Primary Statements

Better Metadata Through SAS II: %SYSFUNC, PROC DATASETS, and Dictionary Tables

Planting Your Rows: Using SAS Formats to Make the Generation of Zero- Filled Rows in Tables Less Thorny

Greenspace: A Macro to Improve a SAS Data Set Footprint

Paper B GENERATING A DATASET COMPRISED OF CUSTOM FORMAT DETAILS

Give me EVERYTHING! A macro to combine the CONTENTS procedure output and formats. Lynn Mullins, PPD, Cincinnati, Ohio

The Problem With NODUPLICATES, Continued

SAS Programming Techniques for Manipulating Metadata on the Database Level Chris Speck, PAREXEL International, Durham, NC

Programming & Manipulation. Danger: MERGE Ahead! Warning: BY Variable with Multiple Lengths! Bob Virgile Robert Virgile Associates, Inc.

Paper S Data Presentation 101: An Analyst s Perspective

PharmaSUG Paper TT11

Autocall Macros A Quick Overview

Building Sequential Programs for a Routine Task with Five SAS Techniques

Using SAS software to shrink the data in your applications

2. Don t forget semicolons and RUN statements The two most common programming errors.

CHAPTER 7 Using Other SAS Software Products

QUEST Procedure Reference

CHAPTER 7 Examples of Combining Compute Services and Data Transfer Services

Using Cross-Environment Data Access (CEDA)

The SERVER Procedure. Introduction. Syntax CHAPTER 8

Optimizing System Performance

9 Ways to Join Two Datasets David Franklin, Independent Consultant, New Hampshire, USA

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA

SAS Macro Dynamics - From Simple Basics to Powerful Invocations Rick Andrews, Office of the Actuary, CMS, Baltimore, MD

Top Coding Tips. Neil Merchant Technical Specialist - SAS

PhUSE Paper CC07. Slim Down Your Data. Mickael Borne, 4Clinics, Montpellier, France

A Useful Macro for Converting SAS Data sets into SAS Transport Files in Electronic Submissions

MOBILE MACROS GET UP TO SPEED SOMEWHERE NEW FAST Author: Patricia Hettinger, Data Analyst Consultant Oakbrook Terrace, IL

Automated Checking Of Multiple Files Kathyayini Tappeta, Percept Pharma Services, Bridgewater, NJ

Using SAS Files. Introduction CHAPTER 5

Hash Objects for Everyone

Create a Format from a SAS Data Set Ruth Marisol Rivera, i3 Statprobe, Mexico City, Mexico

Using Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

SAS Macro Programming Tips and Techniques

ABSTRACT INTRODUCTION TRICK 1: CHOOSE THE BEST METHOD TO CREATE MACRO VARIABLES

USING DATA TO SET MACRO PARAMETERS

KEYWORDS ARRAY statement, DO loop, temporary arrays, MERGE statement, Hash Objects, Big Data, Brute force Techniques, PROC PHREG

Useful Tips When Deploying SAS Code in a Production Environment

The Proc Transpose Cookbook

Customized Flowcharts Using SAS Annotation Abhinav Srivastva, PaxVax Inc., Redwood City, CA

Prove QC Quality Create SAS Datasets from RTF Files Honghua Chen, OCKHAM, Cary, NC

Optimization of large simulations using statistical software

Tired of CALL EXECUTE? Try DOSUBL

Fuzzy Name-Matching Applications

Techniques for Writing Robust SAS Macros. Martin Gregory. PhUSE Annual Conference, Oct 2009, Basel

Get Started Writing SAS Macros Luisa Hartman, Jane Liao, Merck Sharp & Dohme Corp.

Amie Bissonett, inventiv Health Clinical, Minneapolis, MN

Missing Pages Report. David Gray, PPD, Austin, TX Zhuo Chen, PPD, Austin, TX

Acknowledgments xi Preface xiii About the Author xv About This Book xvii New in the Macro Language xxi

Know Thy Data : Techniques for Data Exploration

MedDRA Dictionary: Reporting Version Updates Using SAS and Excel

An SQL Tutorial Some Random Tips

In this paper, we will build the macro step-by-step, highlighting each function. A basic familiarity with SAS Macro language is assumed.

To conceptualize the process, the table below shows the highly correlated covariates in descending order of their R statistic.

Applying IFN and IFC Functions

Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY

ABSTRACT: INTRODUCTION: WEB CRAWLER OVERVIEW: METHOD 1: WEB CRAWLER IN SAS DATA STEP CODE. Paper CC-17

Are Your SAS Programs Running You? Marje Fecht, Prowerk Consulting, Cape Coral, FL Larry Stewart, SAS Institute Inc., Cary, NC

Transcription:

Efficient Processing of Long Lists of Variable Names Paulette W. Staum, Paul Waldron Consulting, West Nyack, NY ABSTRACT Many programmers use SAS macro language to manipulate lists of variable names. They add prefixes or suffixes, insert delimiters, sort lists, identify matching words, etc. When these tools are used for lists with hundreds of entries, they can be surprisingly slow, taking minutes for simple string manipulation. Other methods of implementing list manipulations can significantly speed the processing of long lists. This paper will use DATA step techniques (including hash tables) and PROC SQL to dramatically reduce the time required for processing long lists. INTRODUCTION If generic programs are used for many different input data sets, they usually need to process lists of variables. These lists can come from input data sets or they can be specified by the people who use the programs. Much of the processing of these lists is standard: adding a prefix or suffix to each variable name, inserting delimiters between variable names, sorting the list of variables, etc. These tasks are often handled by utility macros. Usually the lists are stored in macro variables and are manipulated by macro code. Since macro code generally runs in memory without disk access, this seems likely to be efficient. If your data sets have many observations and variables, you expect that most of the processing time will be spent on data handling. However, if your data sets contain hundreds of variables, the processing of variable lists can be the slowest part of the programs. There are non-macro methods for processing these lists. Some of them are: PROC SORT PROC SQL DATA step with scan function DATA step with hash object We will consider two examples of standard list processing tasks: Identifying duplicate items in a list Sorting a list Simple macro language methods for this type of task require comparing each pair of items in a list. The execution times for these examples increase dramatically when there are large numbers of items in the lists. Simpler tasks, such as adding prefixes or inserting delimiters, is not increased as much by long list lengths. BASIC TOOLS Some lists are naturally created in data sets. Others are naturally stored as macro variables. Macros can convert lists between the two forms. These macros will: Convert lists from data sets to macro variables Convert lists from macro variables to data sets Count the number of items in a macro variable list or data set 1

Note that these sample macros are examples, not production quality macros. Production macros would allow the user more control over the returned macro variable and the list delimiter. They would check for errors, use more macro quoting functions, clean up any environmental changes, provide more complete documentation, etc. CONVERTING DATA SET LISTS TO MACRO VARIABLE LISTS The first macro converts a data set list to a macro variable list. It uses PROC SQL because it is faster than using a DATA step with CALL SYMPUT. %macro GetList(InDS=, Var=, OutStr=); %* InDS is the input data set. Var is the variable holding the item. OutStr is the macro variable name.; %* Create output macro variable if needed; proc sql noprint; select &var into :&OutStr separated by ' ' from &InDS; quit; %mend GetList; The output macro variable may already exist as a local variable in a calling macro environment. If the output macro variable does not exist, it is declared as a global macro variable to make sure that it will be available outside of the GetList macro. %getlist(inds=sashelp.class, var=name, OutStr=FromGetList); %put FromGetList = &FromGetList; The next macro is a special case of the GetList macro. It extracts a list of the variables in a data set from a PROC SQL dictionary table. %macro GetVars(InDS=, OutStr=); %let InDS=%upcase(&InDS); %if %index(&inds,.)=0 %then %let InDS=work.&InDS; proc sql noprint; select name into :&OutStr separated by ' ' from dictionary.columns where libname= %scan(&inds,1,.) and memname= %scan(&inds,2,.) ; quit; %mend GetVars; The libname and memname are automatically returned from the dictionary table in upper case. It is important to upper case the input parameter for the data set name. %getvars(inds=sashelp.class, outstr=fromgetvars); %put FromGetVars = &FromGetVars; 2

COUNTING ITEMS IN A MACRO VARIABLE LIST Counting items in a macro variable list is a basic function that can be done in many different ways. Here is one example. This macro can be used like a function call, returning the number of words or items. %macro NWords(InStr=); %local textstr count; %let textstr = %cmpres(%nrbquote(&instr)); %let count = %sysfunc(countc(%nrbquote(&textstr), %str( ))); %eval(&count+1) %mend NWords; %put NItems = %NWords(InStr=abc xy_z); CONVERTING MACRO VARIABLE LISTS TO DATA SET LISTS Another macro converts a macro variable list to a data set list. %macro LoadWords(InStr=, OutDS=work.words, Var=word, MaxLen=200); %* InStr is a macro variable with a list. OutDS is the output data set. Var is the name of the created variable. MaxLen is the maximum length of a word.; %local I NWords; %let NWords = %NWords(instr=&instr); data &OutDS; length &var $&MaxLen; %do i = 1 %to &NWords; &var = "%scan(&instr,&i)"; output; run; %mend LoadWords; This approach requires making an assumption about the maximum length of an item in the list. The Max- Len parameter allows control over this assumption. If the length of the quoted string exceeds 200 characters, there will be a message in the log about possible missing closing quotes. This message can be suppressed by setting the option NOQUOTELENMAX. %LoadWords(InStr=a b c d e f XYZ, OutDS=work.words); COUNTING ITEMS IN A DATA SET LIST Obviously, to count the number of items in a data set list, just count the observations in the data set. This is not necessary for the methods described below, but it is included for completeness. %macro NObs(InDS=); %let dsid=%sysfunc(open(&inds)); %if &dsid %then %do; %let nobs= %sysfunc(attrn(&dsid,nlobsf)); %let dsid=%sysfunc(close(&dsid)); %else %let nobs=-1; &nobs %mend NObs; 3

Using the NLOBSF attribute with the ATTRN function gets the effective number of observations from the data set. %put NumObs = %NObs(InDS=sashelp.class); Now we have some basic building blocks to use when testing different ways of processing lists. Next let s discuss two standard list processing tasks. The first example will introduce different ways of identifying duplicates in a list of items. The second example will discuss different ways of sorting a list of items. EXAMPLE 1: IDENTIFYING DUPLICATE ITEMS One type of list processing macro searches a list of variables for duplicate items. It determines the number of words in a list. It compares each word to every other word. (It checks each pair of words only once, not twice.) It outputs a list of the words which were duplicated. There are several design decisions to be made. 1. Are case differences important or unimportant when comparing words? 2. Should each duplicate be in a separate macro variable or should all duplicates be in one list? If there are embedded spaces in the list values, separate macro variables will be more useful. In the context of lists of variables, one combined list is likely to be more useful. 3. What order should be used for the output list of duplicates? Should it be the input order or an alphabetic order? 4. If a word appears more than two times in the list, should it be included in the output once or multiple times? In our examples, 1) the results are case sensitive and 2) a combined macro variable list is produced. It is easy to switch to case insensitivity and to producing many macro variables as output. The 3) order and 4) redundancy of the output are more intrinsically tied to the method used to identify duplicates, as we will see below. A sample call for this type of macro is: %DupWords(InStr=Black Berry Blue Berry Cherry Pine Apple Straw Berry Apple, OutStr=dups) %put Duplicates = &dups; The InStr parameter is the list of items to check. The OutStr parameter is the name of the output macro variable. MACRO LOOP METHOD The essential functionality in a simple macro method consists of two loops. The first loop isolates each item in the list. The second loop compares the items. This structure is efficient because it minimizes the number of calls to the scan function. %macro DupWords(InStr=,OutStr=); %local i NWords; %else %let &OutStr=; 4

%let NWords = %NWords(InStr=&InStr); %do i = 1 %to &NWords; %local word&i; %let word&i = %scan(&instr,&i,%str( )); %do i = 1 %to &NWords; %do j = &i + 1 %to &NWords; %* Are two words the same?; %if &&word&i = &&word&j %then %do; %let &OutStr=&&&OutStr &&word&i; %goto found; %found: This macro loop example keeps the output list of duplicates in input order. It also retains redundant duplicates. There is a maximum size for each macro variable of 64K. It is possible to exceed this limit for a list of variables in a data set with thousands of variables. In this case, methods without macro variables must be used. The tests were done on an IBM ThinkPad PC in a Windows XP environment. Comparable results were obtained in a Windows server environment, so the results are not specific to any one environment. All times are wall clock time in seconds. This type of macro runs quickly for up to about 100 items. Above 100 items, the required time increases rapidly. This is a problem when long lists (for example for data sets with more than 500 variables) are processed. Identifying Duplicates in a List Macro Execution Number of Words Time in Seconds 40.12 88.40 176 1.41 220 2.23 440 8.99 506 11.58 1012 47.62 The SAS system option MSYMTABMAX controls the amount of memory available to the macro symbol table. The names and values of macro variables are stored in the macro symbol table. If the table s size exceeds the value of MSYMTABMAX, additional macro variables are stored on disk and macro variable processing becomes very slow. The default setting of MSYMTABMAX is 4,194,304. The test results were similar with MSYMTABMAX set to larger or smaller values (8,388,728 and 2,097,512). Let s look at other solutions. For some of these methods, you need to convert lists between a macro variable and a data set. The macro %LoadWord (described above) is used to convert a list from a macro variable to a data set. The macro %GetList is used to convert a list from a data set to a macro variable. PROC SORT METHOD One alternative is to load the list into a data set and sort the data set using the NODUPKEY and DU- POUT options of PROC SORT. Using NODUPKEY limits the primary output to one observation per key value. (The key value is a word in the list.) The DUPOUT option names a data set which will hold any observations with duplicate words. This PROC SORT example produces the output list in alphabetic order. It also retains redundant duplicates. 5

%macro DupWords(InStr=,OutStr=); %LoadWords(InStr = &InStr); proc sort data=work.words dupout=work.dups nodupkey; by word; run; %GetList(InDS=work.dups,Var=Word, OutStr=&OutStr); PROC SQL METHOD Another alternative is to load the list into a data set and use PROC SQL to extract the duplicate items from the list. This PROC SQL example produces the output list in alphabetic order and eliminates redundant duplicates. %macro DupWords(InStr=,OutStr=); %LoadWords(InStr = &InStr); proc sql noprint; create table work.dups(drop=num) as select word, count(*) as num from work.words group by word having num > 1; quit; %GetList(InDS=work.dups,Var=Word, OutStr=&OutStr); DATA STEP SCAN METHOD Another alternative is to use a DATA step with a scan function to identify duplicates. The logic is identical to the macro loop method logic. %macro DupWords(InStr=,OutStr=, MaxLen=200); %local NWords; %else %let &OutStr=; %let NWords = %NWords(instr=&instr); data work.dups(keep=word); length word iword1-iword&nwords $&maxlen; array words {*} iword1-iword&nwords; do i = 1 to &NWords; words{i}=scan("&instr",i); do i = 1 to &NWords; do j = i + 1 to &NWords; if words{i} = words{j} then do; word=words{i}; output; goto found; found: stop; run; %getlist(inds=work.dups, var=word, OutStr=&OutStr); 6

DATA STEP HASH OBJECT METHOD You can also use a DATA step with a hash object to identify duplicates. Quoting the online documentation: The hash object provides an efficient, convenient mechanism for quick data storage and retrieval. The hash object stores and retrieves data based on lookup keys. %macro DupWords(InStr=,OutStr=, MaxLen=200); %local NWords NDups; %else %let &OutStr=; data work.dups(keep=word); length word $&maxlen; declare hash words(ordered:'yes'); rc=words.definekey('word'); rc=words.definedone(); NWords=count(trim("&InStr"),' ') + 1; do i = 1 to NWords; word = scan("&instr",i," "); rc=words.add(); if rc ^= 0 then output; stop; run; %getlist(inds=work.dups, var=word, OutStr=&OutStr); The hash object does not permit duplicate keys, so any attempt to add a duplicate item will cause an error code to be returned. Essentially this method attempts to insert all the list items into a hash object and returns a list of the unsuccessful attempts. This hash table example produces the output list in input order. It also retains redundant duplicates. The hash table results are not significantly affected by varying the hash table size using hashexp. COMPARISONS The results of comparing the different methods to identifying duplicate items in a list are: Identifying Duplicates in a List Number of Words Macro PROC SORT PROC SQL DATA Step Scan DATA Step Hash Object 40.12.04.03.03.04 88.40.07.05.05.05 176 1.41.12.11.11.03 220 2.23.23.21.20.03 440 8.99.66.63.65.05 506 11.58.68.65.67.07 1012 49.05 2.46 2.39 2.37.09 All the DATA step / procedure methods are more efficient than the simple macro code method. For short lists, the timing differences are not important. For long lists, non-macro methods are much faster. The hash object is an excellent choice. 7

When the output order or the handling of redundant duplicates is critical, the differences in the output of the methods may limit your range of choice. This table summarizes the output characteristics of the different methods. Comparing Methods of Identifying Duplicates in a List Method Sample Result Order Redundancy Macro Berry Berry Apple Original Kept Sort Apple Berry Berry Alphabetic Kept SQL Apple Berry Alphabetic Omitted Data Step / Scan Berry Berry Apple Original Kept Data Step / Hash Berry Berry Apple Original Kept EXAMPLE 2: SORTING LISTS Let s take a quick look at another example of a standard list processing task. Suppose you need to sort a list of items that is stored in a macro variable. You start with an unordered list ( go forth and multiply ) and finish with an ordered list ( and forth go multiply ). You can do this with simple macro loops or you can use PROC SORT or PROC SQL. The hash table method is of limited use for this sorting task because its output will not include duplicate items. Here is the PROC SQL method. %macro SortWords(InStr=,OutStr=); %LoadWords(InStr = &InStr); proc sql noprint; select word into :&OutStr separated by ' ' from work.words order by word; quit; %mend SortWords; %SortWords(InStr= go forth and multiply, OutStr=OrderedList); %put OrderedList = &OrderedList; The results of comparing different methods for various lengths of lists are: Sorting a List Number of Words Macro PROC SORT PROC SQL 88 1.10.04.03 176 6.53.09.08 220 12.3.14.11 440 91.6.58.38 506 136.4 1.07.48 Again, a simple macro method is significantly slower than using a procedure. The fastest method is to use PROC SQL to create a sorted list. 8

CONCLUSIONS Macro code does complex processes of long lists of items very slowly. If your program requires examining every pair of items in long lists, you can improve preformance by replacing macro loops with a DATA step or a PROC SQL step. REFERENCES Arthur L. Carpenter, California Occidental Consultants, Storing and Using a List of Values in a Macro Variable, SUGI 30 Robert J. Morris, RTI International, Text Utility Macros for Manipulating Lists of Variable Names, SUGI 30 Ian Whitlock, Macro Bugs How to Create, Avoid and Destroy Them, SUGI 30 ACKNOWLEDGMENTS SAS is a Registered Trademark of the SAS Institute, Inc. of Cary, North Carolina. Thanks to Randy Poindexter of SAS Technical Support for improving the macro method. Thanks to Tom Hoffman of Hoffman Consulting for the code to count items in a list. Any remaining inefficiencies are my own. CONTACT INFORMATION Suggestions, questions and comments are encouraged. Contact the author at: Paulette Staum Paul Waldron Consulting, Inc. 2 Tupper Lane West Nyack, NY 10994 Phone: (845) 358-9251 Email: staump@optonline.net 9