Administration & Support

Size: px

Start display at page:

Download "Administration & Support"

Melvyn McGee
5 years ago
Views:

1 An Animated Guide : Speed Merges: Resource use by common non-parallel procedures Russ Lavery Contractor for ASG, Inc. ABSTRACT This paper is a comparison of how resources are used by different SAS table lookup (Figure 1) techniques. The paper will be useful for programmers starting to study programming efficiency or who are experiencing efficiency problems with current programs. Understanding resource use of basic processes gives programmers a logical structure for selecting the table lookup technique that will most likely solve their efficiency problem. Advanced table lookup techniques all have the characteristic that they perform a Sorted By Merge (or close to one) without sorting. All of the advanced table lookup techniques (Format, IORC, hashing, SQL, tagsort, modify, etc.) make different demands on I/O, CPU and disk space. Understanding the demands of the different table lookup techniques lets a programmer select the best technique in his/her particular situation. Several years of NESUG/SUGI proceedings papers, including many on efficiency topics, are available on line. These papers can be downloaded/read for an in-depth discussion the different table lookup techniques mentioned here. This paper is an overview and hopes to help readers decide which table lookup techniques will best help them and suggests a web site that can be used to search for, and download other useful proceedings papers. INTRODUCTION Concerned with FAST table lookup Table lookup is Our Old Friend the If Statement If zip= then state= PA ; Else if zip= then state= NJ ; Else state=?? ; Table Lookup is often done by accessing another file (by merge, format, IORC, Hashing etc.) Table Lookup is also subsetting Small File Key var1 var2 etc. DATA STEP Often a By Merge Some Logical result Large File Key var1 var2 etc. Efficiency means using resources gently. Working for efficiency has many dimensions and involves trade-offs. Takes too long CPU Memory Utilization Maintenance Programming effort Initial Programming effort EFFICIENCY TRADEOFFS IO Utilization space File is too big Figure 1 Figure 2 A table lookup is the use of a key, or id, variable in a small file to lookup values in a large file. This can often take the form of taking a subset of the large file as is shown in Figure 1. This is such a common task that efficient coding is valuable. The goal of this paper is to support good decision making when selecting an efficient table lookup technique. The paper will discuss the details of common SAS processes and focus on the issues of CPU usage and disk space requirements. Efficiency is a multidimensional concept and usually involves making trade-offs (Figure 2). Programs that use little RAM often have excessive I/O. Programs that run quickly often have high CPU usage/requirements. Knowing how processes use resources helps programmers make those trade-offs. Generally, people study efficiency because a job takes too long to run/write or creates a file that is too big for their system. Generally, we should use the most constrained resource gently. The processes/techniques discussed in this paper are: Basic sequential SAS read of a data set Sorting: 1) General 2) SAS sort 3) Host sort 4) Asserted Sort 5) Ordered by The binary search process Formats Indexes 1)General 2) IORC_ 3) tagsort Key Indexing, Bitmapping and Hashing The paper contains a figure ranking techniques and suggests to identify useful online articles. 1

2 THE BASIC SAS READ When SAS executes a set/merge command it reads the mentioned file top-to-bottom. This top-to-bottom read (Figure 3) is a fast access technique, for SAS. A top to bottom read is part of data step processing and is also how many Procs access data. Total run time is the result of multiplying the time for a single read by the number of observations and total run time can be too high. SAS performs a fast operation (a sequential read) but performs one for every observation. Your SAS sequential read Some File PROCESSING TIME Data new; set old; tax=sales*.06; Reading a SAS file sequentially (first obs to last) is a very fast process (for SAS). SAS performs a read for each observation- but each individual read is fast Your SAS Sort Big File Sorting File Copy of Work.Big Misc. File 2 Misc. File 1 Work.Small Work.Big Sorting creates a sorting file that is approximately 2 1/2 times the size of the source file. Proc sort data=small noequals sortsize=max; by state zip; Proc sort data=big; noequals sortsize=max; by state zip; Figure 3 Figure 4 SORTING AND THE SORTED BY MERGE Sorting is a common SAS technique, and a required step for the Sorted By Merge table lookup. A Sorted By Merge is easy to program but is very CPU and disk intensive. (Figure 4). When SAS sorts a file (say work.small) it creates a temp copy of the file AND a sorting copy that is about 3 ½ times the size of the original file. Sorting of observations is done in the sorting copy and, if successful, the sorted observations are written back to the temp file and then to work.small. After overwriting work.small, the sorting and temp files are automatically deleted and disk/memory space is freed up. It is the large size of the temp files (3 ½ times the size of the original file) that sometimes causes Proc Sort to fill up a disk and to fail. The memsize=max and noequals should be used as sort options whenever possible. If datasets are small (several thousand obs), the Sorted By Merge is recommended because it is so easy to program and the hardware demands (CPU and disk) are easily met by modern equipment. With small files, the preliminary sorting and the top to bottom read in the data step takes so little total run time that there is little incentive to explore advanced table lookup techniques. With large files, programmers can experience time/disk space problems. In this case Sorted By Merges should be avoided. Sorting is a resource intensive operation, taking lots of time and lots of disk space. Sorting large files has crashed systems. SAS has done major reprogramming on Proc Sort in V9.0 / V9.1. The new Proc Sort automatically multithreads and accesses multiple CPUs, when they are available. Results of some simple performance tests are available in a paper by Jacobson and Lavery in this NESUG. Resource demand for Sorted By Merge: CPU: HIGH DISKSPACE: HIGH I/O: HIGH THE SAS ASSERTED SORT The fastest way to sort data is to get the data delivered to you in sorted order by the data supplier and simply tell SAS to treat the data as sorted. This is the Asserted Sort. It takes zero time and is shown in Figure 5. Companies sometimes struggle to sort data that is already sorted, or that can be requested already sorted from suppliers. The sortedby=variable option can be applied to the data or set statement. The box in the lower left of Figure 5 shows what Proc Contents would report about data set one in Figure 5. Proc Contents says the data is sorted but reports that SAS has not sorted the data. If the data had been sorted by a Proc Sort, the Validated characteristic would be YES. SAS will create, and use, the first-dot and last-dot variables on data that is asserted to be sorted. Resource demand for Asserted Sort: CPU: NONE DISKSPACE: NONE I/O: NONE 2

3 SAS Asserted Sort If you know your data is sorted, you can assert that it is sorted and SAS will believe you. You can assert when you create or use. Proc Contents shows: Sortedby: zip Validated: NO data one /*(sortedby=zip)*/; infile datalines; input zip $char5.; datalines; Great time saver!! No work done!! ; data two; set one (sortedby=zip); by zip; if first.zip and last.zip; proc print data=two; Select SAS vs. Host Sort Sometimes the operating system has a very fast sort. You can specify to use: the SAS sort, or the Host OS sort or best sort Conventional wisdom says SAS sort is good for small files (Test and set your the cutpoint) Options sortcp= xx ; Options sortpgm=sas; Options sortpgm=host; Options sortpgm=best; Watch out!!! There can be differences between SAS and Host sort logic Figure 5 Figure 6 HOST SORT VS SAS SORT The word on the street is that the SAS V8.2 Proc Sort is very good for small/moderate data sets but is a bit slow if the data set is large. Unix, and some mainframes have sorting routines that can be faster than SAS Proc Sort. SAS can be instructed to use a custom/host operating system sorting routine rather than its own. Options are shown in Figure 6. A programmer can instruct SAS to use SAS Proc Sort, a host sort, or to use the best sort method (host sort if file is above a certain size). One danger associated with using a host system is that it can use a different sorting sequence from SAS and that files would be sorted one way if they are smaller than the cutpoint (and thereby sorted by the SAS sort) and a different way if they are larger than the cutpoint (and thereby sorted by the host sort). Resource utilization varies with hosts. SAS for MS Windows, before version 9, can not support a host sort. It accepts the commands to pass the sorting task to a Host sort package -without issuing notes/warnings - but does not do anything. Resource demand for Host Sort: CPU:?? DISKSPACE:?? I/O:?? THE BINARY SEARCH PROCESS The binary search process underlies several SAS procedures. While seeming to be very simple, it has been shown, mathematically, to be the optimal search method under conditions common to many SAS programming tasks. Binary searches are used by formats and indexes. Binary searches search an ordered file by repeatedly dividing it in half. Figure 7 shows the binary search process applied to a format. The format records are ordered in the format catalog and SAS finds a format by repeatedly applying simple logic/rules. In Figure 7 we check if subject 10 was assigned to the test or control treatment. Formats use binary searches Format File 001 test 002 control 003 test 004 control 005 control 006 control 007 test 008 control 009 test 010 control 011 test 012 test 013 control 014 test 015 control 016 test 017 control Success PROCESSING TIME SAS uses binary searches to look within formats and indexes. Searching a file by Halves Lets look for subject 10 See if the subject number is in the exact middle of the file If it is not, is it above or below the middle of the file - define range See if the subject number is in the exact middle of the range defined above Repeat TorC= PUT(PAT_ID,InSml.); SYNTAX proc format; value skill LOW -< 1="BAD # LOW" 1="SAS" 2="Java" 3,5,6="Microsoft" Creates THREE 6<-HIGH="BAD # HIGH"; formats value $ Gen_Age "M"="A.M." "F"="A.F." "C","B","A" ="Non-Adult" other ="error"; VALUE $ WHR_IN /*CHAR RANGES */ low-<"00000"="bad zip" "19000"-"19099"="PHILA" "19100"-"19400"="pa" "80000"-"89999"="JERSEY" "OTHER"="UNKNOWN"; PROC PRINT DATA=EX_2; FORMAT GENDER $Gen_Age. JOB SKILL. ; RUN; Data Set Ex_2 NAME GENDER JOB Bob M 1 Russ M 2 Sue F 2 AJ M 1 Dot F 2 Prints using the format OUTPUT NAME GENDER JOB Bob A.M. SAS Russ A.M. Java Sue A.F. Java AJ A.M. SAS Dot A.F. Java Figure 7 Figure 8 3

4 SAS picks the middle observation in the file (subject 9) and asks if that is the subject it was seeking. If it is the desired subject, the search stops. If it is not, SAS asks if the desired subject sorted above, or below, the current number. It is below. SAS then divides the file in half and only considers subjects from 9 to 17. It then picks a number in the middle of the new range (14) and asks if that is the desired subject. If it is, the process stops. It is not. SAS asks if the desired subject is above, or below, the current number. It is above. SAS then divides the range in half and only considers subjects from 9 to 14. It picks a number in the middle of the new range (11). SAS asks if that is the desired subject. If it is, the process stops. It is not. SAS asks if the desired subject is above, or below, the current subject. It is above. SAS then divides the range in half and only considers subjects from 9 to 10. It picks a number in the middle of the new range (10). SAS asks if that is the desired subject. Since it is, the process stops. Information is transferred from the format to the PDV by a put (or input) statement. Figure 7 shows a binary search in a format file. Information is transferred from the format to the PDV via a put statement. FORMATS SAS formats automatically convert data values to the formatted value when data is displayed and are created with a Proc Format. They take time to create and disk space after they are created. They can be re-used, so the cost of creation needs to be paid only once but there is no automatic maintenance performed on formats. Figure 8 shows the normal creation, and typical use, of character and numeric formats. The large box on the left contains the syntax for a Proc Format and a Proc Print that applies those formats. A data set is in the upper right hand corner and the output of the Proc Print, applying the formats to that data set, is in the lower right. Formats are very useful SAS tools to save disk space and time. Generally, formats determine how data is displayed. Formats can save disk space because the data in the SAS data set is stored as the original value (in Figure 8 original values are 1, 2, M, F) and then displayed in a longer form. In the Figure 8, the savings would be substantial if the zips were stored as zips (19101) and then expanded to much longer strings/values like Main Postal Distribution Center, Philadelphia, Pennsylvania. Format table lookup does not require sorting of either data set. The large data set is read from top to bottom and formats are applied for each observation. A large portion of the speed of the format table lookup comes from the fact that that it is usually a memory resident technique and avoids disk access. Theoretically there is no limit on the number of levels in a SAS format, however when the format table lookup executes the whole format must fit in ram memory or suffer. Some OS can not page formats and will crash if the format is larger than the available RAM. Some OS will page large formats between disk and RAM, allowing the job to complete, but increasing run time. SAS code that would select, from a very large file, patients that had been assigned to be controls (imagine the file in Figure 7 is the format in_sml) is shown below. This is a Format Table Lookup Data subset; Set very_big_file; If put(pat_id,insml.)= control ; Run; Formats are stored in a catalog (permanent or work) and take Ram/disk space (Figure 9) When formats are created from an input file (see right hand box in Figure 9) the data from the input file is summarized in a file (often called cntlin). Cntlin is used as input to a Proc Format. This cntlin file must be manually removed from work to release space. While formats are often used for table lookup, they were not designed with this task in mind. As a result, formats perform extra processing, not required for the simple task of table lookup. The format Table Lookup technique is best for table lookup, or a merge, when only one variable in the small file must be brought through the merge. Remember, the PDV is created by the set statement that accesses the larger file and contains those variables (plus SAS variables). While several variables from the small file can be brought through a merge using a Format Table Lookup, it is not often done. The additional steps to concatenate the variables into one label variable and then substring the variables back out after the merge, causes this twist to the basic technique to be infrequently done. When the Format Table Lookup fails to run, or there is a need to bring several variables from small through the merge, a programmer often tries an _IORC_ Merge. An IORC Merge is based on a SAS index and is not a RAM resident technique. It is generally slower than a Format Table Lookup, but often faster than a Sorted By Merge. Resource demand for Format Table Lookup: CPU: HIGH DISKSPACE: Low I/O: Moderate 4

5 SAS Format SAS Index DATA N_SML_TOO; SET BIG; IF PUT(PAT_ID,InSml.)="YES"; RUN; PROC PRINT; RUN; Your Figure 9 Format Format Cntlin Misc. File 2 Misc. File 1 Work.Small Work.Big Formats require creation of a cntlin file and storage of the format in the catalog. data CNTLIN (keep= fmtname start label type hlo); retain FMTNAME "InSml" TYPE "n" LABEL "YES"; set small rename= (pat_id=start )) end=last; output; if last=1 then do; Hlo="O"; label="other" ; start=.; output; end; proc format cntlin=cntlin; Your Small File INDEX Big File INDEX Copy of Big File Misc. File 2 Misc. File 1 Work.Small Work.Big Indexes are from 10% to 50% of the size of the source file. Indexes take time to create! Individual reads are slow! Indexes are good: 1) if they can be reused and 2) if you only want 5% to 10% of the big file. data small(index=(state)); set small; Figure 10 Proc sql; create index state on work.small(state); quit; Proc datasets lib=work; Modify Large; index create zip; quit; INDEXES THE BASIS OF SEVERAL MERGE TECHNIQUES Programmers can use several methods to create an index. Indexes take time to create and disk space after creation. An index can be re-used, so the cost of creation needs to be paid only once. However, indexes are automatically maintained. As observations are added, and dropped, resources are used in maintenance. Generally, indexes take more time to recover data than formats because index use involves more steps than the use of a format - and some of the additional steps are slow. As a first step in data access, an index goes through the same binary search as a format. However, a successful find does not return desired information (Figure 7) but rather a location on the hard drive where the information can be found (Figure 11). Reading the information involves a slow (mechanical) process. The location is passed to the disc controller wit an instruction to find that page of data on the hard drive. The disk spins and the disk arm moves and that page is read. The page of information is passed back to the CPU. The page contains several observations and the CPU parses the information to select the correct observation and variable. Any mechanical process is slow and to be avoided if possible. For table lookup, typically the index is built on the larger file. The smaller file is processed from top to bottom and the index is accessed for every observation in the smaller file. Indexes are attractive if they return a small fraction ( LT 5%) of the large file. If an index table lookup, returns 20% or more of the large file, there are usually faster techniques to employ. THE _IORC_ MERGE The IORC merge does not require sorting data sets but does require an index on the larger data set (Figures 12 & 13). Typically, the smaller file is read top-to-bottom and an indexed lookup is attempted for every observation. Figure 12 shows the first has executed and loaded the variables from that dataset (Day_1) into the PDV. The second set (with the key= option) uses the value in name, on the PDV, to do an indexed read of the data set Up_Dt. In Figure 12 the lookup is successful and variables from UpDt are copied to the PDV. A successful index lookup results in a 0 being written to IORC on the PDV. Reminder Indexes use binary searches Index File subj page row Lets look for subject Once 9 is found, accessing the information is a multi-step process Read head moves to Controller find 5542 Spinning drive OH Obs 8 Obs 9 Obs 10 1 byte /obs Unused Descriptor Index returns the location of a page & row of data on the disk drive INDEX lookup involves more operations than format lookup Page/Block of data read back to CPU PROCESSING TIME Data is Parsed to find proper observation and field Bob Y Russ N Sue N AJ N Fred Y Glenn N KL Y data new3; SYNTAX set Day_1; set UpDt key=name/unique; array setmiss(*) $ ShinSpl - -T_Elb; if _iorc_ then do; _error_=0; _IORC_=0; do i= 1 to dim(setmiss); setmiss(i)=""; end; end; DATA SET: Day_1 Bob S N Eric S N Sue N N Fred S Mark S Walt N KL N T Wayne N T Sally N T OUTPUT FILE Bob Y S N DATA SET: UpDT DATA VECTOR Name Run Sh_Sp T_Elb _N ERROR IORC_ Bob Y S N Figure 11 Figure 12 An unsuccessful attempt to do a table lookup is shown in Figure 13. When the second set statement fails to find a match in the index lookup, it writes a non-zero value to IORC. The non-zero value of IORC causes the do group to execute and the resetting to missing of variables in the PDV. Note: variables from both datasets are on the PDV and output data sets. 5

6 Bob Y Russ N Sue N AJ N Fred Y Glenn N KL Y data new3; SYNTAX set Day_1; set UpDt key=name/unique; array setmiss(*) $ ShinSpl - -T_Elb; if _iorc_ then do; _error_=0; _IORC_=0; do i= 1 to dim(setmiss); setmiss(i)=""; end; end; DATA SET: Day_1 Bob S N Eric S N Sue N N Fred S Mark S Walt N KL N T Wayne N T Sally N T OUTPUT FILE Bob Y S N Russ N DATA SET: UpDT DATA VECTOR Name Run Sh_Sp T_Elb _N ERROR IORC_ Russ N Using a Tagsort involves Your searching in the sorted secondary file and an indexed lookup using the pointer. SAS Tagsort SORTED Secondary file State zip POINTER Misc. File 2 Misc. File 1 Work.Small Work.Big Tagsorts create a secondary file containing sort keys and a pointer and a sort file for the secondary file. proc sort data=big tagsort; By state zip; Proc Contents shows: Sortedby: zip Validated: YES Figure 13 Figure 14 Resource demand for IORC Table Lookup: CPU: HIGH DISKSPACE: Low I/O: HIGH THE TAGSORT A tagsort (Figure 14) involves the creation of a secondary file containing the key and a pointer to the location of the observation on the disk (like an index). Tagsorts are easy to program and can be used to support a Sorted By Merge without sorting a file. The taagsort process is similar to an automated IORC merge in that the basic process is an index lookup. It has all the problems associated with index usage if the small file has more than 5% of the number of obs in the large file. This can be a very slow process, and is not often used today. It reads the source file top to bottom and creates a temp file. It then sorts the temp file. Finally it accesses for main file again with a random access method. While slow, it does avoid sorting the main file and can have some value if the problem is lack of disk space. Resource demand for Tagsort Table Lookup: CPU: HIGH DISKSPACE: Low I/O: HIGH KEY INDEXING BITMAPPING AND MANUAL HASHING Key indexing, bitmapping and manual hashing are only available inside the data step. All desired information from the small file is loaded into a SAS array. The larger file is read from top to bottom and SAS performs a lookup for each observation in the large file. This process is illustrated, on a high level, in Figure 15. In key indexing, bitmapping and manual hashing a mathematical function reads the Key variable from the PDV as the large file is processed. SAS uses the value of the key variable to calculate the array-cell-number that holds the desired information. Retrieving/accessing/using information from the array is quick. Typically, only one variable is carried through a hash-merge. More can be carried through, at the expense of substringing the information that had been stored in the array. This is not often done, since the SAS substring function is not particularly fast. Key indexing, bitmapping and hashing have been described in a series of articles, written by Dr. Paul Dorfmann. These are very fast table lookup techniques because they load the whole small data set into standard SAS arrays arrays that must exist in RAM. These very helpful techniques can be difficult to code and are generally a challenge for the maintenance programmer who inherits code using these techniques. Resource demand for Keyindexing, etc. Table Lookup: CPU: HIGH DISKSPACE: VERY LOW I/O: LOW V9 HASHING In response to the challenge of manually coding hashing algorithms, SAS has created an easy-to-use hashing applet in V9. The applet can be called from within the data step and creates something like a very smart array. Hashing is production in V9 (and higher) and preliminary speed tests show it to be a fast technique. To get maximum speed from, the technique, the whole small file should be memory-resident. If an OS does not page memory objects, SAS may crash if the hash object does not fit in RAM. V9 Hashing has a new file structure and searches it using a two-part algorithm. The file structure/algorithm is one reason that hashing is generally faster than formats for table lookup. An additional factor is that the hashing algorithm was designed to do table lookup and, unlike formats, performs few unneeded operations. 6

7 Small File Key One_var._max No Sorting Small File is AUTOMATICALLY De-duped High Memory Usage FAST-- if.. SAS Array Characteristics (max value in cell) will limit the technique Figure 15 Key indexing, bitmapping & hashing DATA STEP RAM MEMORY Array of Keys Result File Large File Key var1 var2 etc. Can code your own OR SAS V9 will have SAS Coded hashing As we process Large File, quick access to the values in the array, lets us determine if we want the obs from Large File in the result file. Hashing uses two searches Format File 001 test 002 control 003 test 004 control 005 control 006 control 007 test 008 control 009 test 010 control 011 test 012 test 013 control 014 test 015 control 016 test 017 control 018 test 018 control 020 test 021 control 022 test 023 control 024 test 025 control 026 test 027 test 028 control 029 test 030 control 031 test 032 control Figure 16 Bucket 1 Format: test Lets look for subject 7 Bucket 2 Hashing divides 016 test the file into buckets Bucket control And a tree structure below Bucket 4 the buckets 010 control Bucket 5 The hash 018 control function gets you to a bucket Bucket test The method searches down Bucket 7 the tree 020 test Find puts info on the PDV Bucket control 001 test 025 control 012 test 026 test 002 control 019 test 005 control 028 test 003 test 030 control 015 control 032 test 009 control 021 control 014 test 027 test Hashing: control 013 control 004 control 006 control 007 test 029 control 011 test 022 test A comparison of format-table lookup and V9 hashing lookup is shown in Figure 16. The left hand box illustrates how a format will search for subject 7. The right hand side illustrates a search for subject 7 in a hash object. LOADING A V9 HASH OBJECT It is common to think of a hash object as a smart array; an array that can grow as you add information to it. There (Figure 16) is a top level of the hash object (here buckets 1 to 8) that is very similar to an array. The programmer specifies the number of cells in the top level of the hash object, as s/he does in an array and the number of cells in the top level is an important characteristic of the hash object. The smaller file, and it can be of any size, is loaded into the hash object. As observations from the smaller file are processed, the value of a key variable is used by a mathematical (hashing) function to calculate the proper top level cell for that observation. If two observations hash to the same cell, a tree like structure grows below the top-level cell and the data is stored in the tree under that top level. In a normal SAS array, the original data would be overwritten and lost. Several different values of the key variable can hash to the same top-cell and all identical values of key will hash to the same top cell. So, if the key for the hash object is state, all residents of California will be hashed to the same top level cell. As can be seen, trees can be of different depths. Because hash objects can grow, any number of observations can fit in any hash object- regardless of the number of cells in the top level (sizing the top level is a topic for another paper). Cells store the value of key and any values being carried through the table lookup. While many variables can easily be loaded into a V9 hash object, and carried through the table-lookup, hash objects perform best when they can fit in RAM. If the hash object can not fit in RAM, and the OS can not page memory objects, the program might crash. ACCESSING A V9 HASH OBJECT After the small file has been loaded into the hash object, the large file is read top-to-bottom. A hash table lookup is performed for every observation in the large file using the following process. The value of the key variable (in V9, having multiple key variables are easy) is read from the PDV and passed to the hashing function. The hashing function returns a number of the top-level-cell in the hashing object. If a matching observation (matched on the value of key) has been loaded from the small file, into the hash object, that value would be in the top-level cell or the tree under it. SAS examines the top-level cell. If the top level cell does not contain the proper value of the key variable(s) SAS searches down the tree using an efficient method. If the proper level of the key is found, information stored in the hash object is returned to the PDV. If the end of the tree is reached without finding the proper level of the key variable, SAS concludes that this level of Key was not in the small file and return a failure to match (a number NE 0) to a variable on the PDV. Resource demand for V9 Hashing Table Lookup: CPU: HIGH DISKSPACE: VERY LOW IO: LOW SQL SQL is a powerful tool for table lookups and it is very difficult to classify as to how it uses resources. In SQL, the programmer specifies what s/he wants but not how to get it. SQL, every time it runs, invokes a powerful, and complex, subroutine called the SQL optimizer that examines system conditions (file sizes, buffersize, the existence of indexes on variables in the where clause, SQL syntax) and determines how the query should be run. The optimizer has a variety of techniques to use in executing code and anything more than a cursory description is outside the scope of this paper. 7

8 The optimizer, in a merge, can automatically choose to employ one of the following: Cartesian products, a sorted-by-merge, a SQL version of a hash merge, an IORC merge or finally a step-loop merge. Surprisingly, the same query code, run repeatedly over the course of a project with data being refreshed and source files growing, can be executed in one manner for one refresh and in another manner after file sizes have changed. The strength of SQL is its ease of use and the time it saves on the programming task. Complex operations can easily be coded in SQL. Conventional wisdom is: if space and run time are the problems forcing a programmer to explore efficiency, SQL will generally not be the solution. In published tests SQL has generally lost out in speed tests against Sorted By-Merges but the tests did not seem to consider, and did not document, the decisions made by the optimizer. Much development work has been done to create an SQL optimizer logic to automatically minimize the space requirements of SQL and generally the results are very good. After saying that, it seems apparently small differences in syntax/environment can put the optimizer under a severe handicap and produce code that is good, but not maximally efficient. Within the last few years, proceeding papers were published suggesting that SQL can be a very fast and small technique- if the query is written in a way that does not handicap the optimizer. The optimizer is a subtle, brilliant and very useful subroutine and the subject of a series of papers whose titles start with the phrase The Optimzer Project. The author suggests these papers, which can be found online, as starting points for study of this topic. Additionally recommended are papers written by Paul Kent or Lewis Church (both of SAS) as well as papers by the review committee to the SQL Optimizer Project (Paul Dorfmann, Sigurd Hernmansen, Kirk LaFler, and Paul Sherman) Resource demand for SQL (currently): CPU: HIGH DISKSPACE: MED-HIGH I/O: MED-HIGH Our List of tools for table Lookup By Merges & SQL Joins Tag Sorted by merge Indexed by merge SQL Joins SAS Sorted by merge Host System Sorted by merge Asserted to be sorted by merge Update _IORC_ MERGE as table lookup Formats as table lookup SAS V9 Hashing Custom Coded Hashing Custom Coded Bitmapping Custom Coded Key Indexing Figure 17 TOOLS FOR TABLE LOOKUP fast Intensive Memory Intensive CONCLUSION Figure 17 presents a rough guide to the use of these techniques. It is hoped that the information provided in this paper will help point a reader towards the technique that best solves her/his efficiency problem. REFERENCES Several years of proceedings are available online and can be downloaded for free. A good place for searching NESUG and SUGI proceedings as well as SAS_L is For general understanding of how things work in SAS: Aster & Seidelman Professional Programming Secrets McGraw Hill Virgile, Efficiency: Improving the Performance of your SAS Applications SAS Institute SAS course notes 58032: Optimizing s CONTACT INFORMATION (Your comments and questions are valued and encouraged. Contact the author at: Russell Lavery 9 Station Ave. Apt 1, Ardmore, PA 19003, # 3 russ.lavery@verizon.net Contractor for ASG, Inc. SAS is a registered trademark of SAS Institute, Inc., in the USA and other countries. indicates US registration. 8

An Animated Guide : Speed Merges: resource use by common procedures Russell Lavery, Contractor, Ardmore, PA

An Animated Guide : Speed Merges: resource use by common procedures Russell Lavery, Contractor, Ardmore, PA ABSTRACT This paper is a comparison of how resources are used by different SAS table lookup (Figure