An Animated Guide : Speed Merges: resource use by common procedures Russell Lavery, Contractor, Ardmore, PA

Size: px

Start display at page:

Download "An Animated Guide : Speed Merges: resource use by common procedures Russell Lavery, Contractor, Ardmore, PA"

George Grant
6 years ago
Views:

1 An Animated Guide : Speed Merges: resource use by common procedures Russell Lavery, Contractor, Ardmore, PA ABSTRACT This paper is a comparison of how resources are used by different SAS table lookup (Figure 1) techniques. The paper will be useful for programmers starting to study programming efficiency or who are experiencing efficiency problems with current programs. Understanding resource use of basic processes gives programmers a logical structure for selecting the table lookup technique that will most likely solve their efficiency problem. Efficiency means using resources gently. Working for efficiency has many dimensions and involves trade-offs. Takes too long Maintenance Programming effort EFFICIENCY TRADEOFFS IO Utilization Advanced table lookup techniques all have the characteristic that they perform a Sorted By Merge (or close to one) without sorting. All of the advanced table lookup techniques (Format, IORC, hashing, SQL, tagsort, modify, etc.) make different demands on I/O, CPU and disk space. Understanding the demands of the different table lookup techniques lets a programmer select the best technique in his/her particular situation. Figure 2 CPU Memory Utilization Initial Programming effort space File is too big Several years of NESUG/SUGI proceedings papers are online. These papers can be downloaded/read for an in-depth discussion of each of the table lookup techniques mentioned here. This paper will help readers decide which of the table lookup techniques will best help them and offer a searchable database of the articles available online. Concerned with FAST table lookup Table lookup is Our Old Friend the If Statement If zip= then state= PA ; Else if zip= then state= NJ ; Else state=?? ; Small File Key var1 var2 etc. Figure 1 Table Lookup is often done by accessing another file (by merge, format, IORC, Hashing etc.) Table Lookup is also subsetting DATA STEP Often a By Merge Some Logical result Large File Key var1 var2 etc. The processes discussed are: Basic sequential SAS read of a data set Sorting 1) SAS sort 2) Host sort 3) Asserted Sort 4) Ordered by The binary search process Formats Indexes 1) IORC_ 2) tagsort Key Indexing, Bitmapping and Hashing Finally, the paper will conclude with a graphic that ranks the different techniques and offer an access dataset that can be used to identify useful online articles.. THE BASIC SAS READ When SAS executes a set/merge command on a file it reads the file from top to bottom. This top to bottom read is a fast access technique for SAS. A top to bottom read is part of the data step processing and is how many procs access data. Total run time is calculated by multiplying the time for a read by the number of observations. SAS performs a fast operation (a sequential read) but performs one for every observation. Total time can be too high. INTRODUCTION A table lookup is the use of a key, or id, variable in a small file to lookup values in a large file. This can often take the form of taking a subset of the large file. The goal of this paper is to support good decision making when selecting a table lookup technique. The paper will discuss the details of common SAS processes and focus on the issues of CPU usage and disk space requirements. Efficiency is a multidimensional concept and usually involves making trade-offs (Figure 2). Programs that use little RAM often have excessive IO. Programs can run quickly often have high CPU usage/requirements. Knowing how processes use resources help prdogrammers make those trade-offs. SAS sequential read Some File PROCESSING TIME Data new; set old; tax=sales*.06; Reading a SAS file sequentially (first obs to last) is a very fast process (for SAS) SAS performs a read for each observation- but each individual read is fast Generally people study efficiency because a job takes too long to run/write or creates a file that is too big for their system. Efficiency is using the most constrained resource gently. Figure 3 SORTING AND THE SORTED BY MERGE

2 Sorting is a common SAS technique and a required step for the Sorted By Merge table lookup. A Sorted By Merge is easy to program but is very CPU and disk intensive. (Figure 4) When SAS sorts a file (say work.small) it creates a temp copy of the file AND a sorting copy that is about 3 ½ times the size of the original file. Sorting of obs. is done in the temp file and, if successful, the sorted obs are written back to work.small. After overwriting work.small, the temp file is automatically deleted and disk/memory space is freed up. It is the large size of the temp file (3 ½ times the size of the original file) that sometimes causes sorting to fill up a disk and to fail. The memsize=max and noequals should be used as sort options whenever possible. If datasets are small (several thousand obs), the Sorted By Merge is recommended because it is so easy to program and the hardware demands (CPU and disk) are easily met by modern equipment. With small files, the preliminary sorting and the top to bottom read in the data step takes so little total run time that there is little incentive to explore advanced table lookup techniques. Figure 4 SAS Sort Big File Sorting File Copy of Sorting creates a sorting file that is approximately 2 1/2 times the size of the source file. Proc sort data=small noequals sortsize=max; by state zip; Proc sort data=big; noequals sortsize=max; by state zip; With large files, programmers can experience time/disk space problems. In this case Sorted By Merges should be avoided. Sorting is a resource intensive operation, taking lots of time and lots of disk space. Sorting large files has crashed systems. Resource demand for Sorted By Merge: CPU: HIGH DISKSPACE: HIGH IO: HIGH THE SAS ASSERTED SORT The fastest way to sort data is to get the data delivered in sorted order and simply tell SAS to treat the data as sorted. This is the Asserted Sort. It takes zero time and is shown in Figure 5. Clients often struggle to sort data that is already sorted, or that can be requested already sorted from suppliers. The sortedby=variable option can be applied to the data or set statement. The box in the lower left of Figure 5 shows what Proc Contents would report about data set one in Figure 5. SAS says the data is sorted but reports that SAS has not sorted the data. If the data is sorted by a Proc Sort, the Validated characteristic would be YES. SAS will create, and use, the first and last. variables on data that is asserted to be sorted. Figure 5 SAS Asserted Sort If you know your data is sorted, you can assert that it is sorted and SAS will believe you. You can assert when you create or use. Proc Contents shows: Sortedby: zip Validated: NO data one /*(sortedby=zip)*/; infile datalines; input zip $char5.; datalines; Great time saver!! No work done!! ; data two; set one (sortedby=zip); by zip; if first.zip and last.zip; proc print data=two; Resource demand for Asserted Sort: CPU: NONE DISK: NONE IO: NONE HOST SORT VS SAS SORT The word on the street is that the SAS Proc Sort is very good for small/moderate data sets but is a bit slow if the data set is large. Unix, and some mainframes have sorting routines that can be faster that SAS Proc Sort. SAS can be instructed to use a custom/host operating system sorting routine rather than its own. A programmer can instruct SAS to use the SAS sort, a host sort, or to use the best sort method (host sort if file is above a certain size). One danger associated with using a host system is that it can use a different sorting sequence from SAS and that files are sorted one way if they are smaller than the cutpoint and a different way if they are larger than the cutpoint (sorted by the host sort). Figure 6 Select SAS vs. Host Sort Sometimes the operating system has a very fast sort. You can specify to use: the SAS sort, or the Host OS sort or best sort Conventional wisdom says SAS sort is good for small files (Test and set your the cutpoint) Options sortcp= xx ; Options sortpgm=sas; Options sortpgm=host; Options sortpgm=best; Watch out!!! There can be differences between SAS and Host sort logic SAS for MS Windows, before version 9, can not support a host sort. It accepts the commands without issuing notes/warnings but does not do anything. THE BINARY SEARCH PROCESS The binary search process underlies several SAS procedures. It has been shown, mathematically, to be the optimal search method under conditions common to many SAS programming tasks. Binary searches are part of formats and indexes. Binary searches search an ordered file by repeatedly dividing it in half. Figure 7 shows the binary search process applied to a format. The format records are ordered in the catalog and SAS finds a format by 2 OF 6

3 repeatedly applying simple logic/rules. In Figure 7 we check if subject 10 was assigned to the test or control treatment. SAS picks the middle observation in the file (9) and asks if that is the observation it was seeking. If it is the desired observation, the search stops. If it is not, SAS asks if the desired number is above, or below, the current number. It is below. Figure 7 Formats use binary searches Format File 001 test 002 control 003 test 004 control 005 control 006 control 007 test 008 control 009 test 010 control 011 test 012 test 013 control 014 test 015 control 016 test 017 control Success Searching a file by Halves TorC= PUT(PAT_ID,InSml.); PROCESSING TIME SAS uses binary searches to look within formats and indexes. Lets look for subject 10 See if the subject number is in the exact middle of the file If it is not, is it above or below the middle of the file - define range See if the subject number is in the exact middle of the range defined above Repeat SAS then divides the file in half and only considers from 9 to 17. It then picks a number in the middle of the new range (14) and asks if that is the desired number. If it is, the process stops. It is not. SAS asks if the desired number is above, or below, the current number. It is above. SAS then divides the range in half and only considers from 9 to 14. It picks a number in the middle of the new range (11). SAS asks if that is the desired number. If it is, the process stops. It is not. SAS asks if the desired number is above, or below, the current number. It is above. SAS then divides the range in half and only considers from 9 to 10. It picks a number in the middle of the new range (10). SAS asks if that is the desired number. Since it is, the process stops. Figure 7 illustrated a binary search using a format file and shows information being transferred from the format to the PDV via a put statement. FORMATS SAS formats automatically convert data values to the formatted value when data is displayed. Formats can convert one character to another, a character to a number, a number to a character or a number to a different number. Formats are created with a Proc Format. They take time to create and disk space after they are created. They can be re-used, so the cost of creation needs to be paid only once. There is no automatic maintenance performed on formats. Formats are very useful SAS tools to save time and save disk space. SYNTAX proc format; value skill LOW -< 1="BAD # LOW" 1="SAS" 2="Java" 3,5,6="Microsoft" Creates THREE 6<-HIGH="BAD # HIGH"; formats value $ Gen_Age "M"="A.M." "F"="A.F." "C","B","A" ="Non-Adult" other ="error"; VALUE $ WHR_IN /*CHAR RANGES */ low-<"00000"="bad zip" "19000"-"19099"="PHILA" "19100"-"19400"="pa" "80000"-"89999"="JERSEY" "OTHER"="UNKNOWN"; PROC PRINT DATA=EX_2; FORMAT GENDER $Gen_Age. JOB SKILL. ; RUN; Figure 8 Data Set Ex_2 NAME GENDER JOB Bob M 1 Russ M 2 Sue F 2 AJ M 1 Dot F 2 Prints using the format OUTPUT NAME GENDER JOB Bob A.M. SAS Russ A.M. Java Sue A.F. Java AJ A.M. SAS Dot A.F. Java Figure 8 shows the normal creation, and typical use, of character and numeric formats. The large box on the left contains the syntax for a Proc Format and a Proc Print that applies those formats. A data set is in the upper right hand corner and the output of the Proc Print, applying the formats to that data set, is in the lower right. A large portion of the speed of the format table lookup comes from the fact that that it is usually a memory resident technique and avoids disk access. Theoretically there is no limit on the number of levels in a SAS format, however when the format table lookup executes the whole format must fit in ram memory or suffer. Some OS can not page formats and will crash if the format is larger than the available RAM. Some OS will page large formats between disk and RAM, allowing the job to complete, but increasing run time. SAS code that would select, from a very large file, patients that had been assigned to be controls (imagine the file in Figure 7 is the format in_sml) is shown below. This is a Format Table Lookup Data subset; Set very_big_file; If put(pat_id,insml.)= control ; Run; Formats are stored in a catalog (permanent or work) and take Ram/disk space (Figure 9) When formats are created from an input file (see Figure 9) the data from the input file is summarized in a file (often called cntlin). Cntlin is used as input to a Proc Format. This cntlin file must be manually removed from work to release space. DATA N_SML_TOO; SET BIG; IF PUT(PAT_ID,InSml.)="YES"; RUN; PROC PRINT; RUN; SAS Format Formats require creation of a cntlin file and storage of the format in the catalog. Generally, formats determine how data is displayed. Formats can save disk space because the data in the SAS data set is stored as the original value (in Figure 8 original values are 1,2,M,F) and then displayed in a longer form. In the Figure 8, the savings would be substantial if the zips were stored as zips (19101) and then expanded to much longer values like Main Postal Distribution Center, Philadelphia, Pennsylvania Figure 9 Format Format Cntlin data CNTLIN (keep= fmtname start label type hlo); retain FMTNAME "InSml" TYPE "n" LABEL "YES"; set small rename= (pat_id=start )) end=last; output; if last=1 then do; Hlo="O"; label="other" ; start=.; output; proc format cntlin=cntlin; 3 OF 6

4 While formats are often used for table lookup, they were not designed with this in mind. As a result, formats have perform extra processing, not required for the simple task of table lookup. When the Format Table Lookup fails, a programmer often tries an _IORC_ Merge. An IORC Merge is based on a SAS index and is not a RAM resident technique. It is generally slower than a Format Table Lookup, but often faster than a Sorted By Merge. Resource demand for Format Table Lookup: CPU: HIGH DISKSPACE:Low IO: Moderate INDEXES THE BASIS OF SEVERAL MERGE TECHNIQUES Indexes can be created by several procs. They take time to create and disk space after they are created. They can be reused, so the cost of creation needs to be paid only once. Figure 10 SAS Index Small File INDEX Big File INDEX Copy of Big File Indexes are from 10% to 50% of the size of the source file. Indexes take time to create! Individual reads are slow! Indexes are good: 1) if they can be reused and 2) if you only want 5% to 10% of the big file. data small(index=(state)); set small; Proc sql; create index state on work.small(state); quit; Proc datasets lib=work; Modify Large; index create zip; quit; Reminder Indexes use binary searches PROCESSING TIME INDEX lookup involves more operations than format lookup Index File subj page row Lets look for subject Once 9 is found, accessing the information is a multi-step process Read head moves to Controller find 5542 Spinning drive Obs 9 Obs 10 1 byte /obs Unused Descriptor OH Obs Index returns the Page/Block location of a of data page & row read back of data on Data is Parsed to find to CPU the disk proper observation drive and field Figure 11 Generally, indexes take more time to recover data than formats because the use of an indexed involves more steps than the use of a format - and some of the steps are slow As a first step in data access, an index goes through the same binary search as a format. However, a successful find does not return desired information (see Figure 7) but rather a location on the hard drive where the information can be found (see Figure 11). Reading the information involves a slow (mechanical) disk read to find a page of data that contains several observations and then CPU cycles to select the correct observation and variable. Any mechanical process is slow and to be avoided. Indexes are attractive if they return a small fraction of a file. If an index returns 20% of the large file, there are often faster techniques. If an index returns 75% of a file, the process will be slow. THE _IORC_ MERGE The IORC merge requires the use of an index and is illustrated in Figure 12. The first set executes and loads the variables from that dataset (Day_1) into the PDV. The second set (with the key= option) uses the value of key from the PDV to do an indexed read of the data set Up_Dt. In Figure 12 the lookup is successful and variables from UpDt are copied to the PDV. A successful index lookup results in a value of 0 being written to the variable IORC on the PDV. Bob Y Russ N Sue N AJ N Fred Y Glenn N KL Y data new3; SYNTAX set Day_1; set UpDt key=name/unique; array setmiss(*) $ ShinSpl - -T_Elb; if _iorc_ then do; _error_=0; _IORC_=0; do i= 1 to dim(setmiss); setmiss(i)=""; Figure 12 DATA SET: Day_1 Bob S N Eric S N Sue N N Fred S Mark S Walt N KL N T Wayne N T Sally N T OUTPUT FILE Bob Y S N DATA SET: UpDT DATA VECTOR Name Run Sh_Sp T_Elb _N ERROR IORC_ Bob Y S N An unsuccessful attempt to do a table lookup is shown in Figure 13 When the second set statement fails to find a match in the index lookup, it writes a non-zero value to IORC. The nonzero value of IORC causes the do group to execute and reset variables in the PDV. Bob Y Russ N Sue N AJ N Fred Y Glenn N KL Y data new3; SYNTAX set Day_1; set UpDt key=name/unique; array setmiss(*) $ ShinSpl - -T_Elb; if _iorc_ then do; _error_=0; _IORC_=0; do i= 1 to dim(setmiss); setmiss(i)=""; Figure 13 DATA SET: Day_1 Bob S N Eric S N Sue N N Fred S Mark S Walt N KL N T Wayne N T Sally N T OUTPUT FILE Bob Y S N Russ N DATA SET: UpDT DATA VECTOR Name Run Sh_Sp T_Elb _N ERROR IORC_ Russ N Every observation in the data set mentioned in the first set statement triggers an index lookup. This technique is good if the number of observations in the first file is less than 10% of the number of observations in the second file and poor at 30% or more. Resource demand for IORC Table Lookup: CPU: HIGH DISKSPACE: Low IO: HIGH THE TAGSORT A tagsort (Figure 14) involves the creation of a secondary file 4 OF 6

5 containing the key and a pointer to the location of the observation on the disk (like an index). Tagsorts are easy to program and can be used to support a Sorted By Merge without sorting a file. The taagsort process is similar to an automated IORC merge in that the basic process is an index lookup. It has all the problems associated with index usage if the small file has more than 5% of the number of obs in the large file. Resource demand for Tagsort Table Lookup: CPU: HIGH DISKSPACE: Low IO: HIGH Figure 14 Using a Tagsort involves searching in the sorted secondary file and an indexed lookup using the pointer. SAS Tagsort SORTED Secondary file State zip POINTER Tagsorts create a secondary file containing sort keys and a pointer and a sort file for the secondary file. proc sort data=big tagsort; By state zip; Proc Contents shows: Sortedby: zip Validated: YES KEY INDEXING BITMAPPING AND MANUAL HASHING Key indexing, bitmapping and hashing have been described in a series of articles, written by Dr. Paul Drofmann, These are very fast table lookup techniques because they load the whole small data set into standard SAS arrays - that must exist in RAM. These techniques can be difficult to code and are generally a challenge for the programmer who inherits the hashing code. In key indexing, bitmapping and manual hashing a mathematical function reads the Key information from the PDV and calculates the proper bucket in the araray in one fast step. Resource demand for Keyindexing, etc. Table Lookup: CPU: HIGH DISKSPACE: VERY LOW IO: LOW Small File Key One_var._max No Sorting Small File is AUTOMATICALLY De-duped High Memory Usage FAST-- if.. SAS Array Characteristics (max value in cell) will limit the technique Key indexing, bitmapping & hashing DATA STEP RAM MEMORY Array of Keys Result File Large File Key var1 var2 etc. Can code your own OR SAS V9 will have SAS Coded hashing As we process Large File, quick access to the values in the array, lets us determine if we want the obs from Large File in the result file. has created an easy-to-use hashing applet in V9. The applet can be called from within the data step and creates something like a vary smart array Figure 16 Hashing uses two searches Format File Format: test 001 test 002 control Lets look for subject 7 Hashing: test 004 control 005 control 016 test 006 control test 007 Hashing divides the file into 008 control 009 test buckets 017 control 010 control 011 test And a tree 012 test structure below 013 control the buckets 014 test 010 control 015 control 016 test The hash 017 control function gets you directly to a 018 control 018 test 018 control bucket 020 test 021 control The method 022 test searches down 003 test 023 control the tree 024 test 025 control Find puts 026 test matches on 020 test 027 test the PDV 028 control 029 test If no-match 030 control RC is not zero 023 control 031 test 032 control 001 test 025 control 012 test 026 test 002 control 009 test 005 control 018 test 029 test 030 control 015 control 031 test 032 control 021 control 014 test 027 test 008 control 013 control 004 control 006 control 007 test 028 control 011 test 022 test Hashing is production in V9 and preliminary speed tests show it to be a fast technique. To get maximum speed from, the technique, the whole small file should be memory-resident. On some platforms, SAS may crash if the hash object does not fit in RAM. V9 Hashing has a new file structure and searches it using a two part algorithm. The file structure/algorithm is one reason that hashing is generally faster than formats for table lookup. An additional factor is that the hashing algorithm was designed to do table lookup and, unlike formats, performs few unneeded operations. Resource demand for V9 Hashing Table Lookup: CPU: HIGH DISKSPACE: VERY LOW IO: LOW SQL SQL is a powerful tool for table lookups. The strength of SQL is its ease of use and the time it saves on the programming task. Complex operations can easily be coded in SQL. Generally, SQL has lost out in speed tests against Sorted By Merges. The basic SQL process creates a Cartesian product that can be very large. Much development work has been done to minimize the space requirements of SQL, If space and run time are the problems forcing a programmer to explore efficiency, SQL will generally not be the solution. Resource demand for V9 Hashing Table Lookup: CPU: HIGH DISKSPACE: MED-HIGH IO: MED-HIGH CONCLUSION Figure 17 presents a rough guide to the use of these techniques. It is hoped that the information provided here will point a reader towards the technique that best solves her/his efficiency problem. For cheap and quick access to additional information on all of these topics request a copy of the searchable database of online proceedings papers from the author. The database can be used to identify articles on the subject of interest that can then be downloaded from SUGI and NESUG web sites. Figure 15 V9 HASHING In response to the challenge of coding hashing algorithms, SAS 5 OF 6

6 Figure 17 Our List of tools for table Lookup By Merges & SQL Joins Tag Sorted by merge Indexed by merge SQL Joins SAS Sorted by merge Host System Sorted by merge Asserted to be sorted by merge Update _IORC_ MERGE as table lookup Formats as table lookup SAS V9 Hashing Custom Coded Hashing Custom Coded Bitmapping Custom Coded Key Indexing TOOLS FOR TABLE LOOKUP fast Intensive Memory Intensive available from the author. It can be searched for articles on formats, IORC, etc. that can then be downloaded from the web. for a copy. For general understanding of how things work in SAS: Aster & Seidelman Professional Programming Secrets McGraw Hill Virgile, Efficiency: Improving the Performance of your SAS Applications SAS Institute SAS course notes 58032: Optimizing s CONTACT INFORMATION comments and questions are valued and encouraged. Contact the author at: Russell Lavery, Independent Contractor 9 Station Ave. Apt 1 Ardmore, PA # 3 russell.lavery@verizon.net SAS is a registered trademark of SAS Institute, Inc., in the USA and other countries. indicates US registration. REFERENCES Several years of proceedings are available online for free. A MS Access database listing online SUGI/NESUG articles is 6 OF 6

7 7

Administration & Support

Administration & Support An Animated Guide : Speed Merges: Resource use by common non-parallel procedures Russ Lavery Contractor for ASG, Inc. ABSTRACT This paper is a comparison of how resources are used by different SAS table