Craig Ray I ORr, IN::.

Size: px

Start display at page:

Download "Craig Ray I ORr, IN::."

Robyn Briggs
5 years ago
Views:

1 IMPIEMENTATION OF A HASHING ROUTINE IN SAS SOFTilARE Craig Ray I ORr, IN::. 1. INTROOUCTION Hashing may be the fastest generalized technique for table lookup. Table lookup refers to the cross reference of a parameter file based on the value of a variable in the primary (main) file. 4) 5) Result of Lookup - the variables needed from the lookup file. Seek - One conparison in attenpting to perfonn a lookup. For example, if a table lookup was successful after looking at three observations, then the lookup required three seeks. While the concept of hashing is well docl..lireilted in conputer science texts and is often implemented in third generation progranming language applications, hashing has received very little attention in the SAS system. This may be due to alternative non -procedural techniques of table lookup available in the SAS System such as MERGE and PROC FORMAT with the PUT function. (The tenn non-procedural means specifying what is to be done, not how to do it; therefore non-procedural table lookup techniques require less progranroing). While other techniques are available in the SAS System, they should not be used interchangeably. Depending on circumstances, one technique may be clearly superior to all others in terms of efficiency with respect to CPU time and "wall clock" time. Hashing may not be extensively used by SAS users because the inplementation Of hashing in the SAS System is not readily apparent. This psper bridges that gap. Not only is the concept of hashing presented, but the code to inplement the concept in the SAS System is explained. This paper is intended to be an extension to the paper I A Comparison of Table Lookup Techniques (reference 1). In that paper, three techniques for table lookup were discussed and the applicability of each was compared. 2. DEFINITIONS There are five terms that are used extensively throughout this paper: 1) Main File - the primary file of interest that is being processed one observation at a time. 2) Lookup File - the file that will be referenced for all or some 'of the observations of the main file in order to obtain auxiliary information. 3) Key - the variables in connnon between the main file and the lookup file. In most applications, the key is unique for each observation in the lookup file. This is not necessarily true for the main file. (Even if not explicitly stated, for all lookup methods, the key may consist of more than one variable i. e., the key variables may always ~ concatenated into one variable.) For example, the data in Figure 1 represent a mailing list as the main file and a file with unique records for each zip code as the lookup file. The main file consists of each potential customer's name and his/her address. The lookup file. is a s6j?citately ~intained file <;,f auxiliary lnformatlon relatmg to every ZlP code. Table lookup attenpts to relate, each observation of the main file with the corresponding observation of the lookup file by utilizing the key variable. In this example, for each observation in the mailing list (main file), the corresponding observation in the zip code file (lookup file) is related by ZIP (~ to obtain the resulting number of piano tuners indicated by the value of TUNERS (result of ~. 3. CLASSICAL TECHNIQUES FOR TABLE LOOKUP To illustrate the effectiveness of hashing as a technique for searching, it is necessary to first review alternative methods of searching. Other classical methods of searching include sequential and binary search. Conceptually, the technique most easily understood is the sequential search. In sequential search, each observation o~ the lookup file is examined in sequence, untll the desired observation is found. If the lookup file has N observations, then the sequential search will require, on average, N/2 seeks. Obviously except for very small lookup files, this meth~ will not be very efficient. Another classical method of searching is the binary search. If the lookup table is sorted by the key of the lookup, a binary sear<;h will successfully divide the range of obsexvatlons to be searched in half until the desired observation is found. For a lookup file with N observations, the desired. observation will be located on average, in INT (LOG2 (N» seeks. For example, for a lo.okup tabl~ wi~h 10 observations, the deslred observatl0n mil be found on average in three seeks; for a lookup table with 1,000 observations, the desired. observation will be found on average; in nine seeks and for a lookup table with 1,000,000 observations, the sought after observation will be found on average in 19 seeks ~ This is extremely fast when compared with the sequential search. 1184

2 The final classical..,thod of searching ani the subject of this paper is hashing: n;e..,thod of hashing rearranges the observat1ons ill the lookup table such that the value of the key indicates where the observation is to be placed. If this rearrangenent is done well, then the desired observation is found on average in less than 1. 5 seeks. The nuniber of seeks for a table with 1,000 observation will be a~roxiroately the same as for a lookup table w1th 1,000,000 observations. However, the lookup file must be rearranged (i. e. I hashed) into an. a,rea approximately twice as large as the orlgmal lookup file for hashing to be efficient. The use of binary search compared to bashing may be viewed as a trade-off between spaoe and tine. For binary search, the lookup file does not need any extra space as is necessary for bashing. However, each baabing lookup requires fewer seeks on average corrpared to the nurriber binary search seeks. Over the years, the computer industry bas consistently made ~ter resources (e.g., disk and core) more plentlful, making hashing the more attractive alternative. 4. THE CONCEPT OF HASHING To perfonn table lookup using baabing, an internediate step to create a bash table from the lookup file is neoessary. A bash table is constructed. by performing an operation on the value of the key variable(s) of the lookup file. The result of this operation yields the address for each observation in the hash table (Le., the observation nuniber in the bash table HAS data set). T~ act~lly perform the lookup, the sa~e operatlon 15 performed on the key varlables (8) from the main file. This yields the address to look at in the bash table for the ~orrespanding observation (i.e., like key values ill the lookup file and main file will yield the sane address). In the simplest ease, the key itself can be used as the ad:iress. In the exanple below, the lookup table, keyed on ZIP, maps into the bash table if ZIP is used as a pointer into the bash table. Using the following HAS DATA step, the lookup may be performed for each observation in MAIN: SET HASHTABLE (KEEP='l'UNERS) POINT=ZIP; The loo~ i~ successful in only one seek per observat10n ill MAIN; howevar, this is at the ~e of an extreme waste of space. A lookup f11e of merely three observations is rearranged into a bash table of 99,999 obaervations. The constructed bash table is depicted in Figure 2. It is In?re prudent to use a function (Le., "bash algonthm") of the key variable (s) as a pointer into the bash table, as in the exanple below. In this example, the MOD base 10 function is used on the key of each observation of the lookup file. The result of this function indicates the placement of the obsezvation in the bash table. To execute the lookup, the sane MOD base 10 function is performed on the key from the main file; the result of the operation is used to point to the bash table. This lookup step is illustrated by the following HAS IlI\TA step: PTR=MOD(ZIP, 10) + 1; SET HASHTBIE (KEEP='l'UNERS) POJ:NT=,pTR; In this case, the lookup would again be performed in only one seek per observation in MAIN; however, the hash table is considerably smaller than in the previous example. The constructed bash table is depicted in Figure 3. Not all cases, however, can be expected. to work this well. In particular, for any given function of the key, two or more observations from the lookup file will typically yield the same address. These are defined as collisions. If two or more observations flhash" to the same address, only one can go to that location in the bash table. The remainder are sent sequentially to an overflow table. Pointers are then maintained from the bash table to the overflow table. This is demonstrated in Figure 4. The lookup file contains four observations all yielding 3 as the address if the MOD base 10 function is used as the hash algorithm. Arbitrarily, the last three are sent to the overflow table and the variables FIRST and LllST in the hash table point to those observations in the overflow table. When perfonning the lookup, if the hash algorithm for any observation in MAIN yields 3 as an address, the value of ZIP in MAIN is compared to the value of ZIP at observation 3 in the bash table. If they are equal, the lookup is successful in one seek.. Otherwise, FIRST and LllST point to additional observations in the overflow table, whose hashed. values also yielded the address 3. The overflow table is searched sequentially, between FIRST and LAST, until the sought after observation is found or all the observations pointed to are read. TO inplenent bashing, it is critical to be able to execute an operation on the value of the key variable (s) that yields an observation nuniber. Fortunately, baabing is possible even if the keys are character values. It is only necessaxy to convert characrer representations to numeric values using the binary number equivalent of the ASCII or EBCDIC representation of the character string. (This conversion can be done using the data step INPUT function with an appropriate conversion format.) A ntuneric operation can be performed on this numeric equivalent. Finally, for hashing to perform well, a good hash alqorithrn is required. If a poor hash algorithm is used, the net result may hardly be I;etter than a ~equential search. For exarrple, 1f MOD base 1 1S chosen as the bash algorithm, all observations Of the lookup file would yield 1185

the same address. As a result, the hash table would contain only one observation; the remainder would be sent to the overflow table. This would simulate the sequential search.

3 the same address. As a result, the hash table would contain only one observation; the remainder would be sent to the overflow table. This would simulate the sequential search. Ordinarily 1 the MOD function performs exceedingly well. The base of the M:)D function determines the appropriate size of the hash table. It is recomnended that this base be a prilne number to avoid the possibility that an unusually large number of observations "hash" to the same address. To obtain reasonably fast searches (less than 1.5 seeks per lookup, on average), the hash table should be approximately twice the size of the lookup file. This can be adjusted depending on resource constraints. A larger hash table may be created by increasing the base of the M:)D function if speed of the lookup is absolutely essential. The size of the hash table may be decreased at the cost of slower lookups if space is at a prernitnn. Many standard texts (see reference 2) on data structure provide a more thorough treatroont of the subject of hashing. 5. IMPLEMENTATION OF HASHING IN THE SAS SYSTEM The hash table should be stored as a SAS data set, which is stored on disk. This is the rna jor difference between the inplementation of hashing in traditional third generation programning languages and in the SAS System. In traditional languages, the hash table is typically stored in core as a series of parallel arrays. This difference has its trade-offs. Random access in core is nb.lch faster than random access from disk (using SET with the POINT option in SAS); the number of random accesses should be less than 1.5 on average per lookup if the hash table has been efficiently constructed. TYPically disk space is much more plentiful than space in core. Thus, space is not as serious a constraint in SAS, even on micro corrputers, as it is in other languages. The major obstacle to inplement bashing in SAS is creating the hash table. When creating the hash table, it is necessary to. calculate the observation number that indicates the location of each observation of the lookup file. There is no. facility in SAS to. "OUTPUT with a POINT eption. It A design that solves this problem is depicted in Figure 5. A preliminary DATA step performs the hash algoritlun on the key variable (s) and stores this in the variable, ADDRESS. The output data set, HASllVAR, is then sorted by ADDRESS. The last DATA step outputs two SAS data sets: HASHTBLE and OVERFWW. The DATA step insures that the observation number of HASHTBLE and the variable ADDRESS are equal. Where there is more than cne cbservaticn in HASllVAR with the same value.of ADDRESS, all but the last are sent to OVERFWW. Where there are gaps in the values of ADDRESS in liashvar, blank observations are o.utput to. HASHTBLE. This insures that ADDRESS (the result of the hash algorithm) actually co.rresponds to. the observaticn number in HASHTBLE. The code to inplement bashing in the SAS System has been divided into. two. parts: creating the hash table; and perfoiilling lookups on the hash table. The ccde to create the hash table is contained in Figure 6. A preliminary DATA step creates a SAS data set, HASHVAR, that contains the hashing address. The second argument of the MOD function will generally change according to the number of observations in the lookup file. HASH\lAR is then sorted by the calculated address and the generalized. macro, -%HASHTBLE, is called. to create the hash table and overflow table. (Note, as coded. in the macro %HASHTBLE it is required that the variable containing the ADDRESS actually be called ADDRESS.) The code to actually perform the lookup on the hash table is contained in Figure 7. The progranunemr must set up the DATA step and perform the hash function on the key of the lookup and put the result in a variable, ADDRESS. (Note, the function ITUlst be exactly the same as the function used when creating the hash table; therefore, the function may. be placed in a macro that is called in both cases to ensure consistency.) The macro %HASHFIND is then called to search the hash and overflow tables. (Note: the code generated by %Hl\SHFIND is only a portion of a DATA step and does not set up the DATA step.) When the code generated by %HASIIFIND has finished, the program knows if the lookup was successful by comparing the variable~ KEY and TELE _KEY for equality. Any prograrmung statements may follow the call to macro %HASHFIND. 6. WHEN TO USE HASHING IN THE SAS SYSTEM While hashing is the fastest generalized technique for table lookup, it nrust be placed in lts proper perspective within the SAS System. The non-procedural techniques available, namely PROC FORMAT with the PUT function and MERGE, are more appropriate under certain circumstances; under other conditions, a SAS coded binary search would be ~ most appropriate. As a "rule of thumb", the most appropriate method for table lookup in the SAS System can be determined as a function of the number of observation in both the main and lookup files. This is depicted in Figure 8. SAS searches formats very rapidly so that as long as the lookup table is not too large, PROC FORMAT with the PUT function is the best method. Otherwise if the main file is rather small (i.e., very fe~ lookups are required) then a SAS coded binary search is likely to be the best rrethod. The overhead costs associated with just creating the hash table will be greater than the cost of a few binary searches of a sorted lookup table. As can be seen, hashing is a competing method with MERGE. Both perform reasonably well with a fairly large number of Observations in both the main and lookup files (e.g. 30,000 observations 1186

4 in both). MERGE, however, is easier to code. Using MERGE: may require an extra DATA step. FIRST. and LAST. processing may not be possible using MERGE: because the main file ordinarily nrust be resorted by the keys of the lookup file. In this case, hashing may be preferable because it does not require resorting the main file. Additionally, if the main file is very large (e.g., over 1,000,000 observations), then resorting the data set as is required by MERGE: will be prohibitively expensive. In this case, hashing is likely to be the bext technique. In conclusion, hashing in the SAS System is not a teclmique to be used by progranmers for "nm-of-the-millll table lookup applications. Rather, it is a tool to be employed for larger awlications as circumstances dictate. 7. ACKN~S The author is indebted to Stephen weiss who belped to develop the approach for implementing hashing in the SAS System; Bob Pulgino who volunteered use of his Apple Macintosh for the preparation of the slides used for presentation; and Tina Feggans for preparing this manuscript. The author can be contacted at: ORI, Inc. SUite Indiana Avenue, N. W. Washington, D.C (202) References 1) Ray, Craig (1987), "A Comparison of Table LOokup Techniques", Proceedings of the Twelfth SAS User's Group International COnference, Cary, NC: SAS Institute, Inc. 2) Flores, Ivan. Data Structure and Management, 2nd Edition, prentice Hall, Inc., SAS is a registered trademark of SAS Institute Inc., Cary NC, USA. FIG Sample Data rig 2 Create a Hash Table MAIN FILE NAME ZIP POLITICAL PARTY ADAMS, JOHN R COI.OMBUS, CHRIS ~2634 I COOPER, PAULA R DALTON, JAMES D DEBBS, EUGENE S LORAN, NANCY D MARX, KARL C E'ORTER, ALAN R SOBER, TOM D THORPE, MARTHA I LOOKUP FILE lookug Table Ob, t-- Z.ip Tuners, ' Use the value of the kelt. itself as an observation number. Hash Table Ob, Zip Tuners ZIP COUNTY TUNERS MANHATTEN 20 KEY ZIP ARLINGTON MONROE 15 RESULT OF LOOKUP ANY COMBINATION OF OTHER LOOKUP FILE VARIABLES SET HASHTBLE(KEEP=TUNERS) POINT=ZIP; RUN; 1187

5 FIG J Create a Hash Table Use a function of the key as an observation number. Example: FIG Y Collisions Hash Algorithm: Obs = MOD(ZIP,10)+1 Example: Use MOD base 10 Lookup Table JOBS = MOD(ZIP,lO) + 1 Hash Table Ob, LookYQ Table ~? Tune!:s Ob, I'-- Zip 'l'uner;!..-- Ob. zip Tuners S : : ls PTR = MOD (ZIP, 10) + 1; ls S SET HASHTBLE POINT = PTR; RUN; FIG 5 Design for Creating Hash Table in SAS Hash Table Ob, ---- Zip Tuners First , S Ob. ""'-ZiP Overflow Table Tuners '- LaS:'- 3 DATA LOOKUP Contains: KEY Result oi Lookup Perform hash algorithm on key PRCe SORT BY ADDRESS HASHVAR Contains: KEY Result of Lookup ADDRESS DATA Sorted HA SHVAR Create HASHTBLE and OVERFLOW -- C. HASHTBLE ~ TBLE_KEY Result of Lookup ADDRESS FIRST LAST L--. OVERFLOW Contains: TBLE_KEY Result of Lookup ADDRESS 1188

6 !="IG DATA HASHVAR; SeT LOOKUP; ADDRESS = MODCKEY,2347J; RUN; PRoe SORT DATA=HASHVAR; "BY ADDRESS; Run; XHASHTBlE ElSE DO; /* OUTPUT TO OVERFLOW */ OVEROBS + 1; OUTPUT OVERFLOW; END; /)E OUTPUT TO OVERFLOW */ %TESTPRNT(IN=HASHTBLE) xtestprth (I N=OVERFlOW) ~PUT "STR( ).; XPUT NOTE: *** MACRO HASHTBLE HAS FINISHED. XMEND HASHTBLE. %MACRO HASHTBlE; ~:***********************************************; x* THIS MACRO OPERATES'ON A SAS DATASET WHICH :: Xl!: HAS HAD ITS KEYS PUT THRU A HASH ALGORITHM *: Xl!: AND SORTED BY THE RESULT OF THE HASH *~ Xl!: ALGORITHM. THIS MACRO THEN CREATes A SAS *: Xli: DATA SET. HASHTBLE, "!HICH CONTAINS ONE Oas *: x* for EACH UNIQUE HASH ADDRESS DUPLICATE *: Xl!: ADDRESSES ARE SENT TO SAS DATASET OVERFlOH *: Xli: POIfHERS ARE THEN MAINTAINED FROM HASHTBl-E' *: X* TO OVERFLOW. abs IN HASHTBlE WITH NO *: X* MAPPING ARE FILLED IN WITH KEY = MISSING *: ~~ TO INDICATE NO FIND. *; x* INPUT'DATA SET HASHVAR CONTAINS: :IE; X* KEY. :~ XlE ADDRESS *; X~ RESUL T OF LOOKUP 31:. ~* OUTPUT HASHTBLE CONTAINS: ). ;.~ Ig~~E~~Y. :~ x* FIRST & LAST (POINTERS TO OVERFLOW *; ;C~ RESULT OF LOOKUP 1. X* OUTPUT OVERFLOW CONTAINS: ). i: Ig~~E~~Y :; ~: RESUL T OF LOOKUP *; X* WRITTEN By: CRAIG' RAY, OR!, INC. :~ X*. *: 70************************************************; XPUT "STR( ). XPUT NOTE: *n MACRO HASHTBLE HAS BEGUN; DATA HASHTBLE OVERFLOW(DROP=FIRST LAST); LENGTH TBLE~KEY $ 11; DROP OVEROBS HASHOBS KEY; RETAIN FIRST; SET HASHVAR. BY ADDRESS; TBLE_KEY =, ". DO WHILE(ADDRESS > HASHOBS+l). END. OUTPUT HASHTBtE. HASHOBS + I. TBlEJEY = KEY. IF FIRST.ADDRESS AND LAST. ADDRESS THEN DO; /*.SINGlE MAP - No OVERFlOli ) / OUTPUT HASHTBlE; HASHOBS +.1; END. /)E" SINGLE MAP ~ NO OVERFlOi'l )E"/ ELSE IF FIRST.ADDRESS THEN DO; /)E" SEND TO OVERFlOi'l AND INITIALIZE FIRST 31;/ OVEROBs + 1; FIRST = OV EROBS; OUTPUT OVERFLOW; END. /* SEND TO OVERFLOW AND INITIALIZE FIRST 31:/ ELSE IF LAST.ADDRESS THEN DO; /)E OUTPUT TO HASHTBlE X/ END. LAST = OVEROBS; OUTPUT HASHTBLE. HASHOBS + 1. FIRST =.; /1 OUTPUT To HASHTBLE )E/ DATA FIND. ~~~R~~~N~ MOD(KEY,Z347); XHASHFIND FIG 1 /* CHECK KEY = TBLE KEY TO TELL IF OSS FOUND */ IF KEY = TSLE_KEY THEN OUTPUT; RUN. "MACRO HASHFIND; X************************************************. "* X* THIS MACRO PERFORMS THE ACTUAL LOOKUP ON A *; "* HASH TABLE. IT ASSUMES THAT A VARIABLE *; X* NAMED ADDRESS HAS BEEN CREATED CONTAINING *. x* THE OBSERVATIOf~ TO BE REFERENCED IN SAS )E; X* - DATA SET HASHTBlE. THE MACRO MAY THEN GO *. %* TO SAS.DATA SET OVERFLOW BASED ON POINTERS *; X* IN HASHTBlE. *; X* *; X* WRITTEN BY: CRAIG RAY, ORI, INC. *; p " x********-****************************-************. xput "STR( ); XPUT NOTE: *** MACRO HASH FIND HAS BEGUN; IF ADDRESS <= THASHOBS THEN DO; /* SEARCH HASHTBlE AND/OR OVERFLOW */ SET HASHTBLE POINT=ADDRESS NOBS=THASHOBS; IF KEY NE TBLE_KEY AND TBLE_KEY NE I AND LAST NE. THEN DO; /* SEARCH OVERFLOW *-/ OVERPTR = FIRST; DO UNTIL(OVERPTR > LAST OR KEY END; IF OVERPTR <= TOVEROBS THEN DO; /* PERFORM SETS */ SET OVERFLOW POINT=OVERPTR NOBS=TOVEROBS; OVERPTR + 1. END; /* PERFORM SETS */ ELSE OVERPTR = LAST + 1.; /* FORCE END OF LOOP 3V END; /* SEARCH OVERFLOW */ END; /* SEARCH HASHTBtE AND/OR OVERFLOW */ "PUT XSTR( ); XPUT NOTE: *** MACRO HASHFIND HAS FINISHED. XMEND HASHFIND; xput XSTR( ); ~PUT NOTE: *** MACRO HASHFIND NOW LOADED; 1189

7 10,000 2 Ii: c: f.- 'ro :a ~,5 2.; <.).c 0 0 n: Q. '0 ~.c '" E ~ z Sort / Merge II HI-\SHIN(~ Binary Search 5, ,000 Number of Obs, in Lookup File 1190

Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY

Table Lookups in the SAS Data Step Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY Introduction - What is a Table Lookup? You have a sales file with one observation for