Programming Beyond the Basics. Find() the power of Hash - How, Why and When to use the SAS Hash Object John Blackwell

Similar documents
9 Ways to Join Two Datasets David Franklin, Independent Consultant, New Hampshire, USA

Merging Data Eight Different Ways

USING SAS HASH OBJECTS TO CUT DOWN PROCESSING TIME Girish Narayandas, Optum, Eden Prairie, MN

Countdown of the Top 10 Ways to Merge Data David Franklin, Independent Consultant, Litchfield, NH

Basic SAS Hash Programming Techniques Applied in Our Daily Work in Clinical Trials Data Analysis

Paper Haven't I Seen You Before? An Application of DATA Step HASH for Efficient Complex Event Associations. John Schmitz, Luminare Data LLC

An Introduction to SAS Hash Programming Techniques Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Hash Objects Why Bother? Barb Crowther SAS Technical Training Specialist. Copyright 2008, SAS Institute Inc. All rights reserved.

SAS Visual Analytics Environment Stood Up? Check! Data Automatically Loaded and Refreshed? Not Quite

Version 8 Base SAS Performance: How Does It Stack-Up? Robert Ray, SAS Institute Inc, Cary, NC

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA

Using Metadata Queries To Build Row-Level Audit Reports in SAS Visual Analytics

DATA Step in SAS Viya : Essential New Features

Extending the Scope of Custom Transformations

Comparison of different ways using table lookups on huge tables

SAS Data Libraries. Definition CHAPTER 26

Anatomy of a Merge Gone Wrong James Lew, Compu-Stat Consulting, Scarborough, ON, Canada Joshua Horstman, Nested Loop Consulting, Indianapolis, IN, USA

A Hands-on Introduction to SAS DATA Step Hash Programming Techniques

Exploring HASH Tables vs. SORT/DATA Step vs. PROC SQL

KEYWORDS ARRAY statement, DO loop, temporary arrays, MERGE statement, Hash Objects, Big Data, Brute force Techniques, PROC PHREG

A Practical Introduction to SAS Data Integration Studio

Guide Users along Information Pathways and Surf through the Data

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

SAS Scalable Performance Data Server 4.3 TSM1:

Hash Objects for Everyone

Exploring DATA Step Merge and PROC SQL Join Techniques Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Hashtag #Efficiency! An Exploration of Hash Tables and Merge Techniques

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

Bruce Gilsen, Federal Reserve Board

Simple Rules to Remember When Working with Indexes

If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC

OLAP Drill-through Table Considerations

Is Your Data Viable? Preparing Your Data for SAS Visual Analytics 8.2

SAS Data Integration Studio 3.3. User s Guide

Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

Demystifying PROC SQL Join Algorithms

Multi-Threaded Reads in SAS/Access for Relational Databases Sarah Whittier, ISO New England, Holyoke, MA

Parallelizing Windows Operating System Services Job Flows

Paper CT-16 Manage Hierarchical or Associated Data with the RETAIN Statement Alan R. Mann, Independent Consultant, Harpers Ferry, WV

HASH Beyond Lookups Your Code Will Never Be the Same!

Paper TT17 An Animated Guide : Speed Merges with Key Merging and the _IORC_ Variable Russ Lavery Contractor for Numeric resources, Inc.

Not Just Merge - Complex Derivation Made Easy by Hash Object

A Quick and Gentle Introduction to PROC SQL

Using Cross-Environment Data Access (CEDA)

Optimizing System Performance

Using the SQL Editor. Overview CHAPTER 11

WHAT IS THE CONFIGURATION TROUBLESHOOTER?

Streamline Table Lookup by Embedding HASH in FCMP Qing Liu, Eli Lilly & Company, Shanghai, China

Locking SAS Data Objects

A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys

PROC SQL vs. DATA Step Processing. T Winand, Customer Success Technical Team

capabilities and their overheads are therefore different.

Using Data Transfer Services

How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?

INTRODUCTION TO PROC SQL JEFF SIMPSON SYSTEMS ENGINEER

Cleaning up your SAS log: Note Messages

Using PROC FCMP to the Fullest: Getting Started and Doing More

FSEDIT Procedure Windows

Tuning WebHound 4.0 and SAS 8.2 for Enterprise Windows Systems James R. Lebak, Unisys Corporation, Malvern, PA

Performance Considerations

Let Hash SUMINC Count For You Joseph Hinson, Accenture Life Sciences, Berwyn, PA, USA

APPENDIX 2 Customizing SAS/ASSIST Software

Optimizing Testing Performance With Data Validation Option

SAS System Powers Web Measurement Solution at U S WEST

The Future of Transpose: How SAS Is Rebuilding Its Foundation by Making What Is Old New Again

Using PROC FCMP to the Fullest: Getting Started and Doing More

A Side of Hash for You To Dig Into

SAS File Management. Improving Performance CHAPTER 37

Hello World! Getting Started with the SAS DS2 Language

Identifying Duplicate Variables in a SAS Data Set

How to Create Data-Driven Lists

Different Methods for Accessing Non-SAS Data to Build and Incrementally Update That Data Warehouse

A SAS/AF Application for Parallel Extraction, Transformation, and Scoring of a Very Large Database

Chapter 12: Query Processing

USING HASH TABLES FOR AE SEARCH STRATEGIES Vinodita Bongarala, Liz Thomas Seattle Genetics, Inc., Bothell, WA

ABSTRACT MORE THAN SYNTAX ORGANIZE YOUR WORK THE SAS ENTERPRISE GUIDE PROJECT. Paper 50-30

Introduction. LOCK Statement. CHAPTER 11 The LOCK Statement and the LOCK Command

Same Data Different Attributes: Cloning Issues with Data Sets Brian Varney, Experis Business Analytics, Portage, MI

Using Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

Best ETL Design Practices. Helpful coding insights in SAS DI studio. Techniques and implementation using the Key transformations in SAS DI studio.

Using Recursion for More Convenient Macros

Using an Array as an If-Switch Nazik Elgaddal and Ed Heaton, Westat, Rockville, MD

PROC FORMAT: USE OF THE CNTLIN OPTION FOR EFFICIENT PROGRAMMING

Why Hash? Glen Becker, USAA

Best Practice for Creation and Maintenance of a SAS Infrastructure

CSC 261/461 Database Systems Lecture 19

1. Join with PROC SQL a left join that will retain target records having no lookup match. 2. Data Step Merge of the target and lookup files.

Database Optimization

Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

Data Validation Option Best Practices

MTA Database Administrator Fundamentals Course

SAS/Warehouse Administrator Usage and Enhancements Terry Lewis, SAS Institute Inc., Cary, NC

Getting Up to Speed with PROC REPORT Kimberly LeBouton, K.J.L. Computing, Rossmoor, CA

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

DBLOAD Procedure Reference

CHAPTER 7 Examples of Combining Compute Services and Data Transfer Services

System Requirements. SAS Profitability Management 2.1. Server Requirements. Server Hardware Requirements

Netezza The Analytics Appliance

Transcription:

Find() the power of Hash - How, Why and When to use the SAS Hash Object John Blackwell ABSTRACT The SAS hash object has come of age in SAS 9.2, giving the SAS programmer the ability to quickly do things never before possible within a single data step. This paper will demonstrate how easy it is to get started using the powerful features of the hash object and show examples of how these features can improve programmer productivity and system performance. INTRODUCTION The SAS hash object is commonly referred to as an in-memory look-up table. While this is accurate enough, thinking of it as merely a look-up table does not give programmers an intuitive sense of everything a hash object can be used for. This is particularly true in SAS 9.2, where two new features in particular have greatly expanded its usefulness. This paper will provide a brief introduction to the SAS hash object, showing a typical look-up usage and then examining situations where hash objects can be the difference between a job failing and completing. USING THE HASH OBJECT AS LOOK-UP TABLE In this example I am using a data set from a direct-marketing organization that contains the account_id of the customer and a source code representing various catalog mailings. Sample Customer_Mail data: Account_ID Source 8800544 SP45635 8803002 WT4563 9330220 SP45635 5996600 WT4563 5588221 WT4563 5996600 SP45635 The second data set will be used as the look-up table for the source code description and loaded into a hash object. Sample Source_Code data: Source Description SP45635 Spring Catalog 2009 WT4563 Winter Catalog 2009 In order to append the descriptions from the Source_Code data set to the Customer_Mail data using a hash object I executed the following code. In cases where descriptions were missing from the source_code table the description is populated with Missing. data xmpl.mailing_w_descriptions; length description $35; 1

set xmpl.customer_mail; if _n_ =1 then do; declare hash sc (data set: 'xmpl.source_code'); sc.definekey('source'); sc.definedata('description'); sc.definedone(); if sc.find() ge 0; if description ='' then description='missing'; The resulting data set would be identical to what would have been achieved by sorting each data set by source and then using the data step to merge the sorted tables (or by using an outer join in proc sql on the source field). Using the hash object utilizes significantly less cpu time because the data from xmpl.source_code is loaded into memory and minimizes the number of costly disk reads. The above example is a good illustration of the hash object at its most simple; it is a good starting point for the programmer currently unfamiliar with how to use them. The benefits in performance grow exponentially when you are combining multiple large data sets and, in some cases, can be the difference between a job failing and running successfully. For example, programmers often need more than one look-up table, and adding multiple hash objects is fairly simple. In the following example we are adding a region code based on the account_id of the customer as well as appending the source description. data xmpl.mailing_w_descriptions_region; length description region_code $35; set xmpl.customer_mail; if _n_ =1 then do; declare hash sc (data set: 'xmpl.source_code'); sc.definekey('source'); sc.definedata('description'); sc.definedone(); /*second hash object*/ declare hash rg (data set: 'xmpl.customer_geographic'); rg.definekey('account_id'); rg.definedata('region_code'); rg.definedone(); if sc.find() ge 0; if rg.find() ge 0; if description ='' then description='missing'; COMMON ERRORS It is important to note that the DEFINEDONE and DEFINEKEY methods do not affect the PDV. So if you were to execute the above code without the length statement, the log would show the following error message: ERROR: Undeclared data symbol description for hash object ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase. 2

Since we are only adding two variables (description and region_code) through the hash object in the preceding code, including a length statement is a relatively efficient way to add them to the PDV. However, altering the PDV to accommodate several variables of different lengths is not. The most efficient way to alter the PDV is to issue a SET statement for the data set(s) being loaded into the hash object(s) with conditional logic that will never be true. For example, the following line will cause the PDV to add space for the variables in 'xmpl.source_code and 'xmpl.customer_geographic': if 1=2 then do; set 'xmpl.customer_geographic'; set 'xmpl.source_code ; Obviously 1 will never equal 2, so the SET statements will never actually execute. But they will still cause the PDV to be altered. This method renders the length statement unnecessary and, since the variables will be read automatically through the SET statements, the programmer will not even need to know their lengths. THINKING BEYOND LOOK-UP TABLES As of SAS 9.2, the Hash Object has an ordered method which can be used to sort a data set. While sorting is not necessary for look-ups, many SAS procedures that use by-group processing either require, or run faster on, sorted data sets. For example, consider a series of reports that is created from a multi-million-row data set and generated from several proc-tabulate steps. Each proc-tabulate step uses different by groups and therefore requires multiple sorts. The sorts on this data set are extremely costly in terms of CPU time and take several minutes each using proc sort. You can use proc tabulate with the unsorted option, but the performance time on large data sets often makes this impractical. The following code shows how sorted subsets of data (xmpl.market_sort) can be created using proc-sort-andthrough hashing: Sorting with proc sort proc sort data=xmpl.marketing_rev out=xmpl.market_sort (keep=account_id contact_date); by account_id contact_date; Sorting with hash data _null_; length account_id contact_date 8; declare hash hh (data set: 'xmpl.marketing_rev', ordered:'a'); hh.definekey('account_id','contact_date'); hh.definedone(); hh.output(data set: 'dash.hpg24'); Prior to SAS 9.2, the above hash object code would not have worked if duplicate values for account_id and contact date had needed to be preserved in the resulting data set. The hash object would simply take the first value for each key and ignore duplicates. As of 9.2, you can specify that it take the last value or preserve all duplicate-key values with the multidata option. The previous example could be rewritten as follows to preserve duplicate values: Sorting with hash (preserving duplicates with the multidata option) data _null_; length account_id contact_date 8; 3

declare hash hh (data set: 'dash.apps_ints_raw', multidata: y, ordered:'a'); hh.definekey('account_id','contact_date'); hh.definedone(); hh.output(data set: 'dash.hpg24'); When using the hash object to sort, it is useful to remember that you can specify as many fields as you want to sort by as the key. The key is not necessarily a combination of values that would normally be considered a key, but rather any combination of values (character, numeric, or a combination), by which you would like to sort your data. In the above example the cpu time for the hash object sort is 35 seconds versus 7 minutes for the proc sort. USING HASH TO ACCESS DATA OUTSIDE OF SAS One of the most compelling reasons to use hash objects involves working with large data sets stored outside of SAS. For example, most marketing organizations will use SAS to access data from a data warehouse in an RDBMS. Typically SAS programmers will extract data into temporary data sets and then sort and merge them in subsequent steps. But because the data in the RDBMS cannot be sorted or indexed, it is often not possible to use the data step with a merge statement when working with non-sas data. While proc sql can be used with varying degrees of efficiency, depending on the size of the data, it is often impractical. In the scenario where a programmer wants to combine large SAS data sets with large Oracle tables, the proc sql step will often fail. In the following example, code is shown that combines data from SAS (xmpl.account_names) and an Oracle table (orcl.account_purchases). The SAS data set contains the name information for all accounts, and the oracle table is a list of total purchase amount for all accounts. data highest_amt; length total_purchase_amount 8; set xmpl.account_names; declare hash a(data set:'orcl.account_purchases'); a.definekey('account_id'); a.definedata('total_purchase_amount'); a.definedone(); if a.find() ge 0; In its simplicity, this is similar to the first example provided in the paper. However, in this case, the procsql version of the code failed to complete, and the multi-step method of doing an initial extract from the Oracle table, sorting, and then merging using the data step took almost 30 minutes. The preceeding code took only three minutes to run. to run versus about 3 minutes using the above code. In addition, because this method does not call on any of the more complex methods associated with hash, the coding time is significantly quicker even for programmers new to the SAS hash object. While this is technically a case of using the hash object as a look up, it is not what people typically associate with look-up tables since the data being joined through the hash object has the same cardinality as the data set specified in the set statement. CONCLUSIONS The SAS hash object should be thought of as more than just an in-memory table look-up. While traditional look-ups (appending information from a small table to a bigger table) can be performed efficiently with the SAS hash object, in many situations the improvement in performance is only a few 4

seconds. For many users, gains of a few minutes in performance may not be critical and are often not sufficient motivation to learn how to use the hash object. However, in situations where you are dealing with very large data sets and multi-field keys, hashing may be the only viable solution. SAS programmers should see the hash object as an alternative to any proc-sql-join or data-step merge. The only limit to the amount of data that can be loaded into a hash object is the amount of memory available to the SAS session. REFERENCES: Bhat, Gajanan, and Raj Suligavi. 2001. "Merging Tables in DATA Step vs. PROC SQL: Convenience and Efficiency Issues." Proceedings of the Twenty-Sixth SAS Users Group International Meeting. Cary, NC: SAS Institute Inc. Loren, Judy. 2006. "How Do I Love Hash Tables? Let Me Count The Ways!" Proceedings of the Nineteenth Northeast SAS Users Group Meeting. Cary, NC: SAS Institute Inc. Secosky, Jason, and Janice Bloom. 2007. "Getting Started with the DATA Step Hash Object." Proceedings of the First SAS Global Forum. Cary, NC: SAS Institute Inc. ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: John Blackwell The Nature Conservancy 4245 N Fairfax Drive ste 100 Arlington, VA 22203 jblackwell@tnc.org 5