Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY

Similar documents
Merge Processing and Alternate Table Lookup Techniques Prepared by

Table Lookups: From IF-THEN to Key-Indexing

Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX

Table Lookups: Getting Started With Proc Format

PROC FORMAT: USE OF THE CNTLIN OPTION FOR EFFICIENT PROGRAMMING

Tackling Unique Problems Using TWO SET Statements in ONE DATA Step. Ben Cochran, The Bedford Group, Raleigh, NC

Document Imaging User Guide

Paper Haven't I Seen You Before? An Application of DATA Step HASH for Efficient Complex Event Associations. John Schmitz, Luminare Data LLC

PharmaSUG Paper BB01

A SAS Solution to Create a Weekly Format Susan Bakken, Aimia, Plymouth, MN

An Animated Guide : Speed Merges: resource use by common procedures Russell Lavery, Contractor, Ardmore, PA

Why Hash? Glen Becker, USAA

PROCESSING LARGE SAS AND DB2 Fll..ES: CLOSE ENCOUNTERS OF THE COLOSSAL KIND

capabilities and their overheads are therefore different.

How to Create Data-Driven Lists

Hash-Based Indexing 1

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

Create a Format from a SAS Data Set Ruth Marisol Rivera, i3 Statprobe, Mexico City, Mexico

A Format to Make the _TYPE_ Field of PROC MEANS Easier to Interpret Matt Pettis, Thomson West, Eagan, MN

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA

Using Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY

IF there is a Better Way than IF-THEN

SAS File Management. Improving Performance CHAPTER 37

Let SAS Write and Execute Your Data-Driven SAS Code

Paper HOW-06. Tricia Aanderud, And Data Inc, Raleigh, NC

Anatomy of a Merge Gone Wrong James Lew, Compu-Stat Consulting, Scarborough, ON, Canada Joshua Horstman, Nested Loop Consulting, Indianapolis, IN, USA

1. Join with PROC SQL a left join that will retain target records having no lookup match. 2. Data Step Merge of the target and lookup files.

Cleaning up your SAS log: Note Messages

Quicker Than Merge? Kirby Cossey, Texas State Auditor s Office, Austin, Texas

Creating and Executing Stored Compiled DATA Step Programs

Locking SAS Data Objects

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

CHAPTER 7 Examples of Combining Compute Services and Data Transfer Services

EXAMPLE 3: MATCHING DATA FROM RESPONDENTS AT 2 OR MORE WAVES (LONG FORMAT)

File System Interface and Implementation

NO MORE MERGE. Alternative Table Lookup Techniques

Comparison of different ways using table lookups on huge tables

Guidelines for Coding of SAS Programs Thomas J. Winn, Jr. Texas State Auditor s Office

SAS Programming Conventions Lois Levin, Independent Consultant

Optimizing System Performance

The Crypt Keeper Cemetery Software v.8.0. Table of Contents

. NO MORE MERGE - Alternative Table Lookup Techniques Dana Rafiee, Destiny Corporation/DDISC Group Ltd. U.S., Wethersfield, CT

Linked Lists. What is a Linked List?

Chapter 8 Algorithms 1

David Beam, Systems Seminar Consultants, Inc., Madison, WI

SAS Performance Tuning Strategies and Techniques

Paper SAS Programming Conventions Lois Levin, Independent Consultant, Bethesda, Maryland

FSEDIT Procedure Windows

David S. Septoff Fidia Pharmaceutical Corporation

Data Annotations in Clinical Trial Graphs Sudhir Singh, i3 Statprobe, Cary, NC

The DATA Statement: Efficiency Techniques

9 Ways to Join Two Datasets David Franklin, Independent Consultant, New Hampshire, USA

Macros I Use Every Day (And You Can, Too!)

The Dataset Diet How to transform short and fat into long and thin

CSIT5300: Advanced Database Systems

SAS/Warehouse Administrator Usage and Enhancements Terry Lewis, SAS Institute Inc., Cary, NC

Exsys RuleBook Selector Tutorial. Copyright 2004 EXSYS Inc. All right reserved. Printed in the United States of America.

Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

O(n): printing a list of n items to the screen, looking at each item once.

Posters. Workarounds for SASWare Ballot Items Jack Hamilton, First Health, West Sacramento, California USA. Paper

Query Processing with Indexes. Announcements (February 24) Review. CPS 216 Advanced Database Systems

Data Set Options. Specify a data set option in parentheses after a SAS data set name. To specify several data set options, separate them with spaces.

SAS Data View and Engine Processing. Defining a SAS Data View. Advantages of SAS Data Views SAS DATA VIEWS: A VIRTUAL VIEW OF DATA

Different Methods for Accessing Non-SAS Data to Build and Incrementally Update That Data Warehouse

PowerLink Host Data Manager User Guide

Data Transfer Utility Transfer Strategies

Characteristics of a "Successful" Application.

Midterm Exam Solutions Amy Murphy 28 February 2001

PROC SQL vs. DATA Step Processing. T Winand, Customer Success Technical Team

Frequently Asked Questions Customers

SAS ENTERPRISE GUIDE USER INTERFACE

SQL, HASH Tables, FORMAT and KEY= More Than One Way to Merge Two Datasets

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

Summarizing Impossibly Large SAS Data Sets For the Data Warehouse Server Using Horizontal Summarization

WorkflowMax & Xero Month-end Process

SAS Business Rules Manager 1.2

6.033 Computer System Engineering

USING SAS* ARRAYS. * Performing repetitive calculations on a large number of variables, such as scaling by 10;

Software troubleshooting

Simple Rules to Remember When Working with Indexes

Step through Your DATA Step: Introducing the DATA Step Debugger in SAS Enterprise Guide

If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC

Countdown of the Top 10 Ways to Merge Data David Franklin, Independent Consultant, Litchfield, NH

(Refer Slide Time: 01:25)

Chapter 6: File Systems

Sorting Goodrich, Tamassia Sorting 1

Hash Objects for Everyone

Using Different Methods for Accessing Non-SAS Data to Build and Incrementally Update That Data Warehouse

Using Templates Created by the SAS/STAT Procedures

Remodeling Your Office A New Look for the SAS Add-In for Microsoft Office

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

Bruce Gilsen, Federal Reserve Board

Earthwork 3D for Dummies Doing a digitized dirt takeoff calculation the swift and easy way

PROC MEANS for Disaggregating Statistics in SAS : One Input Data Set and One Output Data Set with Everything You Need

How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?

Azon Master Class. By Ryan Stevenson Guidebook #5 WordPress Usage

DPH ERA Remit and Batch Pay Protocol (v. 5/05)

Transcription:

Table Lookups in the SAS Data Step Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY Introduction - What is a Table Lookup? You have a sales file with one observation for each order and an address file, with one observation for each customer. To mail orders, you want to include address in the sales file. You sort both files, then 'merge them by customer, and keep all observations with information from the sales file, as shown in Example I. You have just performed a table lookup. Table lookup describes the process when you want to add information from some other location to every observation in a data set. You add the information according to some rule, usually based on the value of One or more variables in your data set. Example 1: Match Merge PRoe SORT DATA=ORDER BY eustid ; PRoe SORT DATA=ADDRESS BY eustid MERGE ORDER (IN=KEEP) ADDRESS ; BY eustid IF KEEP ; The information you add needs to be stored in some form that the SAS System can use, such as a data set, format, or source code. Table lookup is one of the most common operations you will perform when working with data stored in more than one file. Table lookups are so common for two reasons: first, your data may be coming to you from two separate sources; second, you might store data in separate data sets to avoid wasting space. In our example above, the sales file might have a lot of observations and few variables, so adding customer address could make the file many times larger. Purpose of This Paper Table lookup has received a lot of attention at SUGI in the past; (I), (2), (3) and (4) are only a few of the presentations on the subject. There hasn't been much change in techniques either, with the exception of the KEY = option, which'r describe below. Craig Ray, Leonard Lutomski, and others have done an exceptional job of describing techniques and efficiencies, and I have very little to add. However, everyone who uses the SAS System, or any data base manager, uses table lookup techniques eventually. The purpose of this paper is to help these SAS users recognize when they need to perform a table lookup, as well as to provide enough information on the common techniques that the user has a chance of picking an effective technique. Table lookup is not arcane, and every SAS user can benefit from knowing the basic techniques. Table Lookup Terms Table lookup is easier to discuss if we have a set of common terms. With minor modification, I'm using the terms presented m SAS Language and Procedures, Usage 2.' Primary File - the data set to which you want to add information Lookup Source - the location holding the information you want to add to the primary file; if this is a data set, it's called a lookup file Key Variables - the variables that tell you which dala from the lookup source you want to add to a particular observation in the primary file, Chapter 10 of the SAS Language and Procedures, Usage 2 presents a thorough discussion of table lookup. It covers some techniques not presented here. and is a very useful reference. 1364

Lookup Result - the information, specific to one or more observations in the primary file, from the lookup source that you want to add to the primary file Seek Operation - the process by which you find the lookup result based on the information in the primary file In Example I, the primary file is ORDER, the lookup source is ADDRESS, the key variable is CUSTID, the lookup result is the customer's address, and the seek operation is the match-merge performed based on the MERGE and BY statements. Cautionary Notes Make sure key variable values are unique in the lookup file. Most table lookups are based on having only one answer to the seek operation. That is, there is only one observation in a lookup file for each value of the key variable, or only one appearance of each value of the key variable in your lookup source, if the source is not a data set. I will assume this in my discussion. You should make sure that your table lookups follow this guideline, or you should program very carefully to work around it. If you have more than one observation with the same value of the key variable and different values for the lookup result, you may face some unpleasant results. In a match merge situation, observations from the primary file may be duplicated, which can corrupt totals; moreover, the value of the lookup result could be different from observation to observation. In a binary search or an index search, you will not duplicate observations in the primary file, but you will not kuow which lookup result will appear in the data set you create. Make sure that the only variables that appear in both the primary and the lookup file are the key variables. If the same variable appears in more than one data set, you will get one or the other in the data set you create, but you will not get both! In a match merge, you will get the one from the data set listed last, and in a binary search or an index search, you will get the one from the lookup file. You should either rename or drop the variables from one of the incoming data sets 2 What Techniques Can You Use? One of the first table lookup techniques that people master is the MERGE as shown in Example I. However, there are a whole series of techniques you can use in the DATA step to perfonn table lookups. I will give examples of several other techniques, then discuss when you would want to use them. Using Formats as a Lookup Source Let's say you were working with the sales file above, but rather than needing a customer's address, you need just the customer's name. You could create a format from the customer file, which translates the customer's ID number to their name, then you can call that format in a DATA step or a PROC step to use the customer's name, as shown in Example 2. Example 2: Creating and Using a Format * create the table lookup format, using the CNTLIN option of PROC FORMAT ; DATA CNAMEFMT ; RETAIN FMTNAME "$TONAME" SET ADDRESS (KEEP=CUSTID NAME RENAME=(CUSTID=START NAME=LABEL)); PROC FORMAT CNTLIN=CNAMEFMT SET ORDER ; NAME = PUT(CUSTID,$TONAME.) In Example 2, I am using the CNTLIN option for the FORMAT procedure. This option lets you create a format directly from the contents of a SAS data set. You need to name the variables in your data set appropriately to use this feature. For more 2 You will find a discussion of this in Chapter 4 of the SAS Language Guide (6). 1365

infonnation on creating formats, see (7,275-312). This process is much more efficient if you make $TONAME a pennanent format. For more examples offonnats used in table lookup, see (5,197-204). Using Binary Search of a Lookup File If your lookup source is a SAS data set, you can use a binary search for table lookup. In the DATA step, binary search requires you to use the POINT= option on the SET statement(6,485). POINT= instructs the SET statement to read observations from a file based on the sequence number of the observation. In a binary search, you look for the value of the key variable in the middle of ever-smaller ranges of observations, until you find it. I. The first range is the whole file. 2. By looking at the middle observation in the range, you either (a) find the value you need, or (b) find whether the value you need is larger than the middle or smaller that the middle. 3. If the value you need is larger than the value at the middle, your new range is the half of the old range with the larger values of the key variable. 4. If the value you need is smaller than the value in the middle, your new range is the half of the old range with the smaller values of the key variable. 5. You go back to step 2 until you find the value you were looking for. A program that uses binary search appears in Example 3 at the end of this paper. Binary search is much more efficient if your pennanent copy of the lookup file is sorted by the key variable (2,649). While you don't need to sort the primary file, my example runs much more efficiently if you do sort it by the key variable as well. Using an Indexed Search of a Lookup File An alternative to using binary search for table lookup is to create and use an index based on the key variable. This is much easier to program than a binary search, because the seek operation is built in and you just call it by using the KEY = option on the SET statement(8,43). To build an index, you can use the DATASETS procedure, as in Example 4 (7,264-265), or you can use the INDEX= data set option (8,82-83). Example 4: Creating an Index PROC DATASETS LIBRARY=BILLING MODIFY ADDRESS ; INDEX CREATE CUSTID ; QUIT ; To use the index for table lookup in a DATA step, you use two SET statements, the first for your primary file, and the second SET statement with the KEY= option for your lookup file. The first SET statement assigns a value to your key variable, and the second uses the value of the key variable to find the observation with that value from the lookup file, as in Example 5. Example 5: Using an Index SET ORDER SET ADDRESS KEY=CUSTID RUN There is no need to sort either file, although you can refine this technique by sorting the primary file and performing a seek operation only for the first observation for each value of the key variable. Using Arrays as a Lookup Source You could also use arrays as a lookup source. To do this, you need to find a way to identify the lookup result based on the key variable. The easiest way is to set up your array so that you get the lookup result when you use the value of the key variable as the array subscript. In Example 6, I'm assuming that there are 50 customers, and that CUSTID varies between 1 and 50. This method is critically dependent on developing a way to perfonn the seek operation. In my example, I'm assuming that the key variable has acceptable values for array SUbscripts. If this is not the case, you will need to either (a) create a fonnat that maps the key variable to the array subscript you need, or 1366

Example 6: Using an Array LENGTH NAMEI-NAME50 $30 ; ARRAY NAMES{50} NAMEI-NAME50; IF N EQ 1 THEN DO -I = 1 TO 50 ; SET ADDRESS (KEEP=NAME) NAMES{I} = NAME; END; SET ORDER ; NAME = NAMES{CUSTID} (b) store the key variable in an array and use a sequential or binary search on it. See (5,204-210) for an example using a fonnat. Other Techniques There are other table lookup techniques. Blocks of hard-coded infonnation, such as: WHEN(l) MONTH="January"; WHEN (2) MONTH=" February" ;.. etc.. are actually a fonn of table lookup. For small tables that change very infrequently, this is not necessarily a bad method. There are also more specialized techniques, like hashing (3), the Golden Section Search (2), and use of macro variables (4). Each has its benefits but is less widely used than the techniques I presented above. Which Technique Should You Use? If you haven't already spent some time looking into table lookup, you might be asking, "What's the matter with match merge? It works." The short answer is that there's nothing wrong with it. However, it might not be the fastest technique for your purpose. Recently, I evaluated techniques my department uses to find names in a lookup file with 500,000 observations. Often, our primary file only had 100-200 observations. The technique we used was match merge. I found that the most efficient technique is indexing the file and using KEY= for our table lookup. Using an index saves 98.5% of the CPU time used by match merge. Often efficiency is not an issue, when you write a program that will be used once. When you develop applications for frequent use, you need to pay more attention to efficiency, and selecting the right table lookup can be a big part of it. Leonard Lutomski (2) did extensive testing and benchmarking, and his findings may be useful when you choose a technique for table lookup. You may want to choose by testing each approach in your application and seeing which one perfonns best. I usually narrow my choices by excluding techniques that don't work well in my circumstances. IfI have a critical application and more than one technique that can work well, I test them all. Otherwise, I just pick the easiest to use. In Table I, I've listed some issues that you should consider when you need to perfonn a table lookup. You need to look at your application and identify the issues, such as: Do I have inore than one key variable? Does my lookup result contain more than one variable? Are my primary and lookup files already sorted by the key variable? Will I use most of my lookup file? Only a small part of my lookup file? In my example with the 500,000 observation name data set above, I identified the following issues: I. We had one key variable and one variable in the lookup result. 2. We used a very small part of the lookup file at any time. 3. We could rearrange observations in the lookup file any way we wanted. 4. We had space for an index. 5. The primary file was not always sorted. The best use of match merge is when both files are already sorted by the key variable and you will use all of the lookup file, especially if you have more than one key variable or more than one variable in your lookup result. That is clearly not the case here. 1367

Of the other techniques, I considered using a fonnat first. Since there is only one key variable and one variable in the lookup result, we did not have two of the major limitations on using formats. However, the number of observations in our lookup file was well beyond the practical limits of using a fonnat. Since arrays cannot handle large lookup sources, I. considered binary search and using an index. Both are effective with small primary files and large lookup files. Binary search requires significant prograrruning, and an index requires significant space. Since we would use the table lookup frequently in many programs and since we had the space to create an index., using an index and SET with the KEY= option for table lookups is our best choice. Conclusion Table lookup is a technique that SAS users apply very frequently to their work without always consciously recognizing what they are doing as table lookup. By recognizing when you need a table lookup and learning about table lookup techniques, you can quickly choose among the techniques to develop efficient applications.' If you develop an application where efficiency is critical, you can narrow down your choices, then test each table lookup technique for optimal performance in your circumstances. References 1 Ray, Craig, "A Comparison of Table Lookup Techniques," SAS Institute, Inc., Proceedings of the Twelfth Annual SAS Users Group International Conference, Cary, NC: SAS Institute, Inc., 1987, pp 40-50. 2 Lutomski, Leonard S., "The Comparative Efficiency of Table Lookup Methods," SAS Institute, Inc., Proceedings of the Thirteenth Annual SAS Users Group International Conference, Cary, NC: SAS Institute, Inc., 1988, pp 648-653. 3 Ray, Craig, "Implementation of a Hashing Routine in SAS Software," SAS Institute, Inc., Proceedings of the Thirteenth Annual SAS Users Group International Conference, Cary, NC: SAS Institute, Inc., 1988, pp 1184-1190. 4 Ray, Craig and Howard Levine, "Efficient Use of Table Lookup Procedures," SAS Institute, Inc., Proceedings of the Eleventh Annual SAS Users Group International Coriference, Cary, NC: SAS Institute, Inc., 1986, pp 718-723. 5 SAS Institute, Inc., SAS Language and Procedures: Usage 2, Version 6, First Edition, Cary, NC: SAS Institute, Inc., 1991, 649 pp.. 6 SAS Institute, Inc., SAS Language Reference, VerSion 6, First Edition, Cary, NC: SAS Institute, Inc., 1990, 1042 pp.. 7 SAS Institute, Inc., SAS Guide, Version 6, Third Edition, Cary, NC: SAS Institute, Inc., 1990, 705 pp.. 8 SAS Institute, Inc., SAS Technical Report P222, Changes and Enhancements to Base SAS Software, Release 6. 07, Cary, NC: SAS Institute, Inc., 1991, pp 20, 82-83. SAS is a registered trademark of SAS Institute Inc. in the USA and other countries. indicates USA registration. 1368

Table 1: Desirable and Undesirable Characteristics of Table Lookup Techniques Desirable Characteristics Match Binary Index Merge Fonnat Search Search Array Is it simple to program? yes yes no yes no Can it be added easily to an if files are yes no yes no existing DATA step? already sorted Is it easy to use lookup results yes no yes yes no with more than one variable? Is it easy to use multiple yes no no yes no key variables? Undesirable Characteristics Does it need a sorted yes no no no no primary file? Does it need a sorted yes no yes no, but no lookup file? needs index Is it slow if you use a small yes no no no yes part of a large lookup file? Is it slow if you use most or no no yes yes depends on all of a large lookup file? logic of seek operation Is there a practical limit on no yes no no yes the size of the lookup source? Other Comments All Techniques: Sorting the primary file and performing the seek only when the value of the key variable changes can improve performance. Formats: You can store the format in a permanent library to avoid overhead. If you have ranges of values for the key variable that should give the same lookup result, you can specify the range and save space. Index: You can build more than one index for a single lookup file. Indexes take significant resources to create, and significant space to store. Array: If your key variable is not a valid array subscript, you need to build an efficient seek operation, or the array technique will not give you good results; see above. 1369

Example 3 - Binary Search DATA BILLS * ORDER is the primary file and does not have to be sorted, but this is more efficient if it is sorted by CUSTID SET ORDER; * START and END are the numbers of the first and last observation in the range you're looking at, while POINTER points to the middle of the range START = 0 ; END = CNUM ; POINTER=INT((END+START)/2) * if your last seek operation found the same key variable, the seek, loop will not run at all, and the lookup result from the last seek will be used ; DO WHILE (CUSTID NE SEEKID) ; * the only way START will be greater than END is if the last seek had START equal to END and was also unsuccessful ; IF START GT END THEN LEAVE ; * read the lookup file according to the latest value of POINTER, note that ADDRESS must be sorted by CUSTID! ; SET ADDRESS (RENAME=(CUSTID=SEEKID) KEEP=CUSTID RESULT) KEY=POINTER NOBS=CNUM ; * if you found a key larger than the one you wanted then use POINTER-l as the END of the new range,otherwise, use POINTER as the START of the new range; IF CUSTID LT SEEKID THEN END = POINTER-l ; ELSE IF CUSTID GT SEEKID THEN START = POINTER+l * reset POINTER to the middle of the range ; POINTER=INT ( (END+START) /2) END ; IF CUSTID NE SEEKID THEN DO ; PUT 'customer not found' CUSTID= * set the lookup result to missing, to avoid bad data; RESULT " END RUN 1370