Paper TT17 An Animated Guide : Speed Merges with Key Merging and the _IORC_ Variable Russ Lavery Contractor for Numeric resources, Inc.

Size: px

Start display at page:

Download "Paper TT17 An Animated Guide : Speed Merges with Key Merging and the _IORC_ Variable Russ Lavery Contractor for Numeric resources, Inc."

Marcus Gibson
5 years ago
Views:

1 Paper TT7 An Animated Guide : Speed Merges with Key Merging and the _IORC_ Variable Russ Lavery Contractor for Numeric resources, Inc. ABSTRACT The key mege (A.K.A. _IORC_ merge) is an efficiency technique. It is a method of merging two files without having to perform a slow, and disk-space-consuming, pre-sorting of the files. Because this merge does not require pre-sorting of the files to be merged, it can be faster and uses less disk space than a by merge. The _IORC_ merge is considered one of the "Table Lookup Techniques" and as such is a competitive technique for by merges, formats, if-else if blocks, key-indexing, bitmapping, hashing and SQL joins. It is a useful technique for SAS programmers because it is fairly fast (faster than a by merge ) and easy to understand. Additionally, _IORC_ merging is part of the material in the SAS certification exam. This paper accompanies an animated presentation at NESUG and can not duplicate the animated effect. It will outline the features of the _IORC_, as were presented. More material can be found in the excellent articles found in the on-line SUGI proceedings and listed in the reference section of this paper. INTRODUCTION This paper will explain details of the _IORC_ merge and how the Program Data Vector (PDV) is modified as the _IORC_ merge executes. An understanding of the PDV is key to understanding this technique. PROGRAM (PDV) FACTS The PDV can be thought of as a data storage area. It functions much like a one line Excel Spreadsheet. The PDV has a column for every variable you read in from the data set, every variable you create in the data step and some automatic variables (_n_, _ERROR_ and _IORC_). When SAS processes a data step set, it copies your data -ONE LINE AT A TIME- into the program data vector. All calculations in your data step will be performed in the program data vector and the results of your calculation will be stored in the PDV. When you have executed all the statements in the data step, values in the PDV will be written to the output file. If data comes into your PDV from a SAS file (as opposed to cards or a text file), it will automatically be retained until that data set is accessed again with a set command. If a data step accesses two SAS data sets, variables from both data sets will be retained in the PDV until the set statement associated with that data set next executes. THE BUSINESS PROBLEM Imagine you are working at a college and you send your assistant to the gym on the first day of class. He interviews people waiting in a line to get their gym lockers and asks them if they are joggers. This information is recorded in a file called Day_. At the end of term your assistant goes to the health office,in the gym, and looks at records of people who visited the health office complaining of either shin splints (a runner's problem) and tennis elbow (Tennis elbow information is not required but your assistant got carried away). This information is put in a file called UpDt. Note that there is not a good match of names between the files. Also note that shinslints are coded S hinsplints and No and tennis elbow is coded T ennis elbow and No (see Figure ). For our merging goal, we desire a file that contains all the people we interviewed on day and information about their health problems that we collected at end of term. Source data sets, code, the PDV and output file are shown in Figures, 2, 3 and 4. These figures are also shown, in a larger size, in the appendix. THE SAS CODE The files we desire to merge are of different sizes. The first step in an _IORC_ merge is to index the larger file. As you can see in Figure ; the data step, where the merge takes place has two set statements. Count from the top and think of them as set and set2. Set executes first. The data set in the set2 must have the key= option and an index. The larger file should be indexed and put in the set2 statement - the set statement that has the Key= option. - -

2 The position of the small and large data sets can be reversed (put smaller file in set2) and the technique will still work, however the job will run faster if the larger file is indexed and used in set2 (the set statement with the key= option). You might code the indexing as: Proc datasets lib=work; modify UpDt; index create name/unique; quit; Figure Y Sue N AJ N Fred Y Glenn N KL Y data new3; SYNTAX set Day_; set UpDt key=name/unique; 2 array setmiss(*) $ ShinSpl - -T_Elb; if _iorc_ then do i= to dim(setmiss); setmiss(i)=""; DATA SET: Day_ S N Smaller Eric S N file is used Sue N N in the first Fred S set. Mark S Walt N KL N T Wayne N T Sally N T 2 Name Run Sh_Sp T_Eelb T_Elb _N ERROR IORC_ Y S N 0 0 Name Run Sh_Sp T_Elb Y S N 3 DATA SET: UpDT Larger file is indexed and used the set wilh keyword Index/unique Figure shows the first observation being processed. The statement data New3 creates the PDV and sets user variables to missing. At the top of the data step several things happen automatically. It sets _n_ = and sets _error_ and _IORC_ to zero (_IORC_ is set to zero at the top of the data step ONLY for the first observation). The data step executes statements from top down and executes the following statement (circle () in Figure ): set day_; The above statement reads only the variables from the file Day into the PDV. At this time the PDV only contains values for Name () and Run(Y). Sh_Sp and T_Elb contain missing values. Next SAS executes the statement (2): set UpDt key=name/unique; This second set statement performs an indexed table lookup inside UpDt. Since the command option is Key=name SAS looks in the PDV and gets the current value of name. It then uses the current value of name () to perform an indexed lookup in the file UpDt. SAS looks in UpDt for an observation with name=. When/If it finds such an observation the second set statement executes. The attempt to perform an indexed lookup and the copying of the information to the PDV are separate steps. When the set executes, it copies the values of the variables in UpDt from the data set into the PDV. Since there was a successful table lookup, _error_ and _IORC_ stay at zero. Since _IORC_ is zero the x-ed out box of code does not execute (Figure ) for this observation. When SAS reaches the bottom of the data set it outputs the observation to the output file (circle (3) in Figure ). The output file contains variables for from both files. Figure 2 shows part of the processing of a "no-match" observation

3 Y Sue N AJ N Fred Y Glenn N KL Y data new3; set Day_; set UpDt SYNTAX key=name/unique; array setmiss(*) $ ShinSpl - -T_Elb; if _iorc_ then do i= to dim(setmiss); setmiss(i)=""; DATA SET: Day_ S N Smaller Eric S N file is used Sue N N in the first Fred S set. Mark S Walt N KL N T Wayne N T Sally N T Name Run Sh_Sp T_Eelb _N ERROR IORC_ Russ Y N S N Name Run Sh_Sp T_Elb Y S N DATA SET: UpDT Larger file is indexed and used the set wilh keyword Index/unique Figure 2 When control passes to the top of the data step, two automatic variables are modified. First, _n_ is incremented by. Second, While it is not easy to see here, _error_ is set to Zero. The value of _IORC_ is not automatically modified at the top of the data step. SAS processes the first set command, the line marked with a () in Figure 2. set day_; This line will read the data, from Observation 2 of the file Day_, into the PDV. After the above line executes, the PDV contains values "Russ" and "N" from the second observation in file Day_. However; because of the automatic retain of data from SAS data sets, the PDV still contains information that came from 's record in the file UpDt. Data Step processing continues as shown below. Figure three shows the execution of the second set statement, the line marked with a (2) in Figure 3: set UpDt key=name/unique; SAS looks in the PDV and gets the current value of name (Russ). It then attempts an indexed lookup in the file UpDt for an observation with name= Russ. When it fails to find an observation with name=russ in UpDt, the set does not execute. SAS sets _error_ to and _IORC_ to a non-zero number. Dangerously, it does not reset the values of Sh_Sp and T_Elb to missing. The values for these variables have been retained from, are not correct, and must be corrected manually. Since the value of _IORC_ is not zero the box of code (circle (3) in Figure 3) executes

4 Y Sue N AJ N Fred Y Glenn N KL Y data new3; SYNTAX set Day_; set UpDt key=name/unique; 2 array setmiss(*) $ ShinSpl - -T_Elb; if _iorc_ then 3 do i= to dim(setmiss); setmiss(i)=""; DATA SET: Day_ S N Smaller Eric S N file is used Sue N N in the first Fred S set. Mark S Walt N KL N T Wayne N T Sally N T Name Run Sh_Sp T_Eelb _N ERROR IORC_ Russ Y N S N Name Run Sh_Sp T_Elb Y S N DATA SET: UpDT Larger file is indexed and used the set wilh keyword Index/unique 2 Figure 3 Figure 4 shows the effects of executing the box of code (circle (3) in Figure 4). The variable _error_ is set to zero (circle () in Figure 4) to suppress the printing on the error message in the log. The variable _IORC_ is set to zero (circle (2) in Figure 4) just to be tidy. This resetting of the _IORC_ is not required for correct execution of the merge. We use array logic (circle (3) in Figure 4) to set all the variables that came from the indexed data set (UpDt) to missing. Note that the reset to missing logic would have to be a bit more complex if we had brought in a mixture of numeric and character variables from UpDt (SAS arrays should be all numeric or all character). After all the code in the box finished executing, the observation would be copied to the output data set. Figure 4 Y Sue N AJ N Fred Y Glenn N KL Y data new3; SYNTAX set Day_; set UpDt key=name/unique; array setmiss(*) $ ShinSpl - -T_Elb; if _iorc_ then 3 do i= to dim(setmiss); setmiss(i)=""; DATA SET: Day_ S N Smaller Eric S N file is used Sue N N in the first Fred S set. Mark S Walt N KL N T Wayne N T Sally N T Name Run Sh_Sp T_Eelb _N ERROR IORC_ Russ Y N S N Name Run Sh_Sp T_Elb Y S N DATA SET: UpDT Larger file is indexed and used the set wilh keyword Index/unique 2

5 The code in Figures through 4 created a data set that has all the observations from Day_, regardless of the success of the matching attempt. If the match on an observation was successful, the output data set has variables from both input data sets. If the matching attempt was not successful, the observation has missing values. This output structure is often what a client wants. SELECTING OBSERVATIONS IN BOTH FILES The code would be slightly different if your goal were to create a data set that contains just the people that are in both files. That code is below. The pictures above can be used to examine the details of the logic. *Index larger file; proc datasets lib=work; modify UpDt; index create name/unique; quit; data new2; set day_; set UpDt key=name/unique; if _iorc_ NE 0 then delete; This code checks for "failed index lookup" by checking _IORC_. If _IORC_ is not zero, the code resets _error_ to zero and deletes the observation. USING AN _IORC_ MERGE WITH A COMPOUND INDEX Proc SQl; create table Small (name char(5), sex char()); insert into small values("pat,"m") values("pat,"f ) values("sam,"f ) values("russ","m"); data Cmpnd; set Small; set ForCmpIndx key=nmsex/unique; if _iorc_ NE 0 then age=.; Fix SYNTAX no need for Compile an animation of ASSIGN Proc SQl; create table ForCmpIndx (name char(5), sex char(), age num); insert into ForDblIndx values("pat","m", values("pat 0) 0) values("pat","f", values("pat,"f 4), 4) values("sam","f", values("sam","f, 9 9); proc sql; sql; Create index NmSex on on ForCmpIndx(name ForCmpIndx(NAME, sex);, SEX); Same process as before so... compound index Obs name sex age Pat M 0 2 Pat F 4 3 Sam F 9 4 Russ M. Figure 5 As figure 5 shows, a compound index is quite easy to use in an _IORC_ merge. Note that failures to find must still be fixed, as is shown for subject Russ

6 USING AN _IORC_ MERGE TO SELECT/UPDATE A VARIABLE IN A SPECIFIED ORDER Image a business situation where you have files containing customer reported address changes for this year (2008) as well as 2007, 2006 and This might occur in a non-profit organization where the files are records of contributions. Management wants, for a select group of people, the most recent address. This can be done with an _IORC_ merge. Client wants the most current address for these people Proc SQL; create table GetThese (name char(9)); insert into GetThese Values("") Values("Chee") Values("Lahong") Values("Murali") Values("Yanmei") Values("Russ") ; No address at all Sort Curr address File Address 2007 File Address 2006 File Address 2005 File Name Curr Address AptCB Chee Lahong Murali Yanmei Name Address2007 Apt7B Lahong Apt7L Murali Apt7M Name Address2006 Apt6B Chee Apt6C Murali Apt6M Name Address2005 Apt5B Chee Apt5C Lahong Apt5L Murali Apt5M Yanmei Apt5Y Remove blank lines and lines with blank addresses Sort Index Index Index Name Curr Address Apt CB Name Address2007 Apt7B Lahong Apt7L Murali Apt7M Name Address2006 Apt6B Chee Apt6C Murali Apt6M Name Address2005 Apt5B Chee Apt5C Lahong Apt5L Murali Apt5M Yanmei Apt5Y Figure 6 Figure six shows the data files in a few ways. The SQL code shows the people for which we want addresses. The yellow boxes show information, by year, in a layout that makes it easy to see who is present in any particular year. The data files should be sorted, or indexed, before being fed into the _IORC_ merge. Curr address File Address 2007 File Address 2006 File Address 2005 File Name Caddr AptC Data MostCurrent; merge getthese CurrAddr; by name; if CAddr in ("", " ") then /*search 2007 file*/ set Addr07 key=name /unique; CAddr=Addr07; /*Search 2006 file*/ set Addr06 key=name /unique; CAddr=Addr06; /*Search 2005 file*/ set Addr05 key=name /unique; CAddr=Addr05; seaddr="nomatch"; /*name ~found*/ /*end of 2005 do*/ /*end of 2006 do*/ /*end of 2007do*/ Name Address2007 Apt7B Lahong Apt7L Murali Apt7M SYNTAX "" "Chee" "Lahong" "Murali" "Russ" "Yanmei" Name Address2006 Apt6B Chee Apt6C Murali Apt6M Sorted No address at all for Russ Name Address2005 Apt5B Chee Apt5C Lahong Apt5L Murali Apt5M Yanmei Apt5Y (abbreviated) Name CAddr Addr07 Addr06 Addr05 _ERROR IORC_ Chee AptCB Apt6C Compile Apt6C Name CAddr Addr07 Addr06 Addr05 AptCB Chee Apt6C Apt6C Figure 6-6 -

7 Figure 6 shows the SAS supervisor after processing Chee s data. Data read from a SAS data set is automatically retained, so the PDV started holding s data. Chee s name was read from the file GetThese, leaving s Caddr in the PDV. SAS tried to do a by merge to get information on Chee in the CAddr (Current Year Address) file and failed. Caddr, in the PDV was set to missing. SAS used an _IORC_ to try to read Chee s address from Addr07 and failed causing the variables _error_ and _IORC_ to become non zero. Moronically, SAS copied Chee s missing value for Addr07 into Caddr. This data step could have been coded to eliminate this, but the resulting code might not run faster and would not fit in a PowerPoint slide. Since the value in _IORC_ is not zero processing continues. _error_ is assigned a value of 0 for two reasons. It will not automatically be reset to zero if there is a successful _IORC_ lookup. If _error_ is not zero when control reaches the bottom of the data step, SAS will write a note to the log. If there are thousands of No finds, the log will become unwealdy. SAS then uses another _IORC_ lookup to read Addr06 and finds information on Chee in that file. _IORC_ is set to zero, because of the successful read. This automatic re-set to zero is in contrast to how SAS treats the variable _error_. A value is read into Addr06 in the PDV and then assigned to Caddr. Since the value of _IORC_ is zero, the X-ed out code does not execute. Figure 7 Curr address File Address 2007 File Address 2006 File Address 2005 File Name Caddr AptC Data MostCurrent; merge getthese CurrAddr; by name; if CAddr in ("", " ") then /*search 2007 file*/ set Addr07 key=name /unique; CAddr=Addr07; /*Search 2006 file*/ set Addr06 key=name /unique; CAddr=Addr06; /*Search 2005 file*/ set Addr05 key=name /unique; CAddr=Addr05; seaddr="nomatch"; /*name ~found*/ /*end of 2005 do*/ /*end of 2006 do*/ /*end of 2007do*/ Name Address2007 Apt7B Lahong Apt7L Murali Apt7M SYNTAX "" " Chee" "Lahong" "Murali" "Russ" "Yanmei" Name Address2006 Apt6B Chee Apt6C Murali Apt6M Name Address2005 Apt5B Chee Apt5C Lahong Apt5L Murali Apt5M Yanmei Apt5Y Sorted Lets see the final file. No address at all for Russ (abbreviated) Name CAddr Addr07 Addr06 Addr05 _ERROR IORC_ Chee Apt6C Compile Apt6C Note the automatic retains. Name CAddr Addr07 Addr06 Addr05 AptCB Chee Apt6C Apt6C Lahong Apt7L Apt7L Apt6C Murali Apt7M Apt7M Apt6C omat Apt7M Apt6C Yanmei Apt5Y Apt7M Apt6C Apt5Y Figure 7 shows the final file. Note that Caddr has the correct information but that Addr07 and Addr06 have errors because of the automatic retaining of values read in from SAS data sets. These variables are not needed by the client and, in early versions of this paper had been removed by a drop on the Data statement. The drop option was removed, to allow these variables, and values, to flow through to the final data set as an aid to understanding the internal process of this merge

8 While the above merge is interesting and useful when resources are limited. The same result could be produced with the simple by merge shown below. In a merge, data sets to the right overwrite data sets to the left. proc sort data=getthese; by name; proc sort data=curraddr; by name; proc sort data=addr07; by name; proc sort data=addr06; by name; proc sort data=addr05; by name; Data EasyWay; merge GetThese Addr05 rename=(addr05=caddr)) Addr06 rename=(addr06=caddr)) Addr07 rename=(addr07=caddr)) CurrAddr; by name; CONCLUSION The Table Lookup function (i.e. merging two data sets, selecting observations from a large file that are also in a smaller file or performing a long series of if-else if processing) is a common SAS task and SAS programmers should know the best ways to perform this task. The _IORC_ merge is a fast way of selecting observations from a large data set. It does not require sorting of the data sets thus conserves CPU time and disk space. For more details please read the excellent articles that are online at the NESUG and SUGI web sites and that are mentioned in the reference section. The articles by Sandra Aker were especially helpful to the author. Anyone needing to perform Table Lookup function quickly, and without sorting the data sets, should also investigate using formats and hashing. There are several articles on table lookup using formats in the NESUG and SUGI proceedings. Hashing is a little more difficult to master than _IORC_ and format Table Lookups but can be very fast. REFERENCES Aker,Sandra, 997, Table Look-up using Indexes, SQL, Arrays and formats without using Matched Merge Data Steps, In the Proceedings of the 997 North East SAS Users Group Conference, page 3 Aker,Sandra, 999, Using Indexes to Perform Table Look-up In the Proceedings of the 999 North East SAS Users Group Conference, page 79 Carpenter, Art, 200, Table Lookups: From IF-THEN to key-indexing, In the Proceedings of the 23rd SAS Users Group International Conference, paper 58 Croonen and Theuwissen, 2002, Table Look-up: Techniques Beyond the Obvious, SUGI 27 In the Proceedings of the 27th SAS Users Group International Conference, paper Foley, Malachy J.,997, Advanced MATCH-MERGING: Techniques, Tricks, and Traps In the Proceedings of the 22nd SAS Users Group International Conference, paper 39 Gober, John, 998, Understanding Indexed Datasets and Using Direct Access Queries In the Proceedings of the 23rd SAS Users Group International Conference, paper 64 McAllister, Doug, 998, Indexed Table Lookup vs Multiple Data Sets, In the Proceedings of the 998 North East SAS Users Group Conference, page 40 Raffee, Dana, 997, NO MORE MERGE: Alternative Table Lookup Techniques In the Proceedings of the 22nd SAS Users Group International Conference, paper 88 Riba, David, 2002, Table Look-up Techniques Other Than the Matched Merge DATA Step, In the Proceedings of the 27th SAS Users Group International Conference, paper 27 Stinson, Walter, 2000, Indexing: My new best friend for table lookup, In the Proceedings of the 2000 North East SAS Users Group Conference, page 293 Zdeb, Mike, 200 Five (or more) Alternatives for Record Selection From One File Based On Information In Another In the Proceedings of the 200 North East SAS Users Group Conference, page 355 CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Russell Lavery- Contractor for Numeric Resources Ardmore, PA russ.lavery@verizon.net - 8 -

9 SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies

An Animated Guide : Speed Merges: resource use by common procedures Russell Lavery, Contractor, Ardmore, PA

An Animated Guide : Speed Merges: resource use by common procedures Russell Lavery, Contractor, Ardmore, PA ABSTRACT This paper is a comparison of how resources are used by different SAS table lookup (Figure