Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX

Size: px

Start display at page:

Download "Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX"

Janice Bryant
5 years ago
Views:

1 1/0 Performance Improvements in Release 6.07 of the SAS System under MVS, ems, and VMS' Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX INTRODUCTION The goals of Release 6.06 of the SAS" System were to incorporate new functionality and to add an entirely new look and feel to the SAS System while continuing to support most features from Version 5. Release 6.06 was indeed more flexible and powerful than Version 5, but it was not as fast. The goal of Release 6.07 of the SAS System was improved periormance. Release 6.07 refines the powerful system introduced in Release 6.06 so that it is as fast as or faster than Version 5; that is, Release 6.07 provides all the power and flexibility of Release 6.06 without sacrificing performance. In short, with Release 6.07, SAS Institute is giving you something (advanced features) for nothing (performance comparable to Version 5). CPU Time in Seconds Figure 2 Observations length lielease -s.u! Comparison of Sequential Writes Many aspects of the SAS System have been optimized to improve its overall performance in Release This paper describes the most significant of the 110 enhancements for Release The data presented in this paper were collected under MVS. The conclusions have been verified on the SAS System under CMS and VMS. SEQUENTIAL READS AND WRITES Most SAS applications sequentially process the observations within a SAS data file. Activities such as merging files, producing reports, analyzing data, and generating graphs involve sequentially reading and writing observations in SAS data files. A major goal for Release 6.07 development was to improve the CPU performance of sequentiali/o. Release 6.07 uses the following two techniques to optimize sequential access to SAS files: streamlining the code path for sequential access setting default 110 buffer sizes to favor sequential processing. Figure 1 summarizes the impact of the streamlined code path, and the new default buffer sizes on the CPU performance of sequential reads. Figure 2 summarizes sequential writes. CPU Tim.. in Seconds Figure 1 and Figure 2 show the CPU time required to sequentially process 100,000 observations in Releases 5.18, 6.06, and For all releases, the CPU time required to process a file increases as the observation length increases. The slope of the line describing the increase, however, is greatest in Release The slope for Release 5.18 and 6.07 are roughly equivalent Release 6.07 consistently uses less CPU time than Release Streamlining Code In both Release 6.06 and Release 6.07, a SAS prooedure specifies what type of access it requires when it opens a SAS data file. In Release 6.07, when a procedure opens a file for sequential access, the SAS System uses a streamlined set of subroutines to process that file. These subroutines optimize individual reads and writes by avoiding movement of data out of 1/0 buffers bypassing unnecessary checks (for example, if you don't use the 08S= option, the SAS System no longer checks to see if each read has passed the 08S= option limit) reducing the layers of code that an observation must pass through. These streamlined techniques can greatly reduce the amount of CPU time required to process a SAS data file. Setting Default Buffer Sizes to Favor Sequential Processing Figure Ob5erVa.tions length Release Comparison of Sequential Reads Release 6.06 chose buffer (also called page) sizes to minimize the amount of wasted space within a page and to keep page sizes as small as possible. (Wasted space is free space at the end of a page that is not sufficient to hold another observation.) The Release 6.06 algorithm attempts to find the smallest page size that wastes no more than five percent of its total space. This algorithm conserves disk space and memory consumption at the expense of CPU performance. Release 6.07 chooses default buffer sizes to minimize the consumption of CPU time during sequential processing. On all three platfonns, increasing the number of observations per page decreases the amount of CPU time required to process a file sequentially. The CPU perfcrrrlnce improvements are dramatic up to a point, but then they taper off. The optimal points for the MVS, ~MS, and VMS operating systems are independent of the size of the observations. 960

To understand how the new page-size algorithm works, consider the following data gathered under MVS. Optimal CPU performance was achieved at 80 observations per page.

2 To understand how the new page-size algorithm works, consider the following data gathered under MVS. Optimal CPU performance was achieved at 80 observations per page., Thus, the optimal page size for a file with observations 100 bytes long is 8000 bytes. The optimal page size for a file with observations 50 bytes long is 4000 bytes. Of course, the SAS System must round these optimal page sizes up to accommodate operating system constraints such as block size and small SAS System overhead. Larger page sizes can have negative consequences when memory is scarce because the larger default page size may mean you will not have enough memory to read a SAS data file in all of your appr. cations. If memory is a more valuable resource for your application than CPU time, you may want to use the BUFSIZE = option to specify a particular page size. Larger page sizes can also have negative consequences for an application that accesses data in a random pattern. One example of such an application is the use of the POINT= option in the DATA step's SET statement. Access will be random when the value of the variable specified in the POINT = option varies by a large amount from one execution of the SET statement to the next. Another example of such an application is the use of varying observation numbers on the command line of the FSEDIT procedure to update observations in a random pattern. The negative effects of large page sizes on random access applications are exaggerated in a SAS server (a SAS session executing the SERVER procedure of SASfSHARE~ software) by the large number of opens being processed concurrently. It is difficult to determine an optimum page size for an application in which data is accessed randomly. You must estimate how many observations on each file page are likely to be used while each page is in memory. If your application uses (or can be programmed to use) clusters of observations, you may be able to select a page size that groups all of the observations in a cluster on the same file page. On the other hand, if your application accesses data in no predictable pattern, a smaller page size will minimize the amount of I/O and CPU time wasted by reading the unused observations on each file page. SAS servers fall into the last category because the access pattern of an online data-entry or data-update application can be impossible to predict. The amount of 1/0 time spent reading wasted observations can be Significant in a SAS server's execution, so you should be especially sensitive to the page sizes of files accessed through SAS servers by your applications. NEW FILE FORMATS In Release 6.06 the size of a SAS data file was greater than the size of a similar file in Version 5 because the Release 6.06 file format required 12 bytes of overhead for each observation. For SA$ data files with a small record length (that is, with few variables), this 12- byte overhead could be significant. File compression was introduced as a way of minimizing the impact of larger file sizes on disk usage, but many sites were unable to absorb the extra CPU cost of compressing and decompressing the observations within a SAS data file. Because of customer concerns over both the increased size of noncompressed files and the CPU cost of reading compressed files, Release 6.07 introduces new file formats for both noncompressed and compressed files. Lean File Format for Noncompressed Files To achieve more efficient 1/0 processing by decreasing file sizes, Release 6.07 introduces a lean format for noncompressed files. The lean file format reduces the overhead associated with each observation from 12 bytes to a single bit. The bit associated with each observation in the lean format flags deleted observations. The Release 6.07 lean file format is in all cases an improvement over the format in Release As Table 1 Shows, under MVS and CMS, the new format is also an improvement over the Version 5 format. Table 1 Overhead per Observation by Release Overhead per observation All three operating systems Release of the SAS System * 6.07* 4 bytes (MVS) 4 bytes (ems) o bytes (VMS) 12 bytes 1 bit The percentage of improvement in file size from Release 6.06 to Release 6.07 varies depending on the size of the observations and on the number of observations in the file. As Table 2 illustrates, the decreased file size is most significant for files with small observations. Table 2 Effect of Observation Length on Lean File Format Number of Pages* Observation Percent Size Release 6.06 Release 6.07 Improvement The data sets contained observatioos each. and the page Size was held to a constant size of 6144 to aid comparison. All features available with the Release 6.06 file fonnat are available with the Release 6.07 lean format. Release 6.07 can create both formats. A site that is sharing data between Releases 6.06 and 6.07 will want to use the 6.06 format, but all other sites will want the enhanced 6.07 format. For information on specifying the file format you want, refer to the section Specifying a File Format, later in this paper. Note that the lean format does not apply to compressed files because the 12-byte overhead per observation is needed to manage compressed data. The 12-byte overhead per observation in a compressed SAS data file will make it possible that a compressed version of a file will be larger than a noncom pressed version of the same file. The compressed format must be able to average compressing 12 or more bytes per observation to be smaller than a noncompressed lean file. Release 6.07 software prints a note to the log when you create a compressed file that tells you how much you saved (or lost) by compressing a SAS data file. Faster Compressed Files Many users of Release 6.06 were happy with the amount of disk space saved when they compressed their files. Some users, however, were unable to use compression because of the CPU cost involved in decompressing the file every time it was read. Release 6.07 introduces a new compressed file format that decompresses faster than the Release Both compressed formats replace repeated bytes within an observation buffer with a repetition factor and a single occurrence of the byte.- The old format prefixes all compressed fields with an escape 961

3 character. The new format prefixes both noncompressed and compressed fields with a length. The two methods do not differ in the amount of time required to compress a file; however, the difference in the amount of time required to decompress a file can be substantial. The old algorithm must search all uncompressed data for the escape character that prefixes a compressed field. The new algorithm does not need to scan for an escape character because all fields begin with a length specification. The improvement you see reading compressed files will Valy depending on how many uncompressible fields an observation buffer contains and on the length of these uncompressible fields. To measure the effects of the new compressed file format, consider the four SAS data sets described in Table 3. Each of these data sets contains 50,000 SO-byte observations. The differences are in the contents of the observations. Table 3 Data Set Name ALL AVERAGE MISSING NONE Four Different Compressed Data Sets Contents of Data Set Contains almost all compressible data, including one 72-byte blank character variable. This data set also contains an B-byte noncompressible variable. Contains a mixture of compressible and noncompressible data, including a 20-byte compressible field, a 10-byte noncom pres sible field, a 10-byte compressible field, a 32-byte noncompressible field, and a numeric variable with values ranging from 1 to 50,000. Contains all missing values, including ten numeric variables with all missing values. Contains almost all noncompressible data, including one 72-byte noncompressible variable. This data set also contains an B-byte compressible variable. Table 4 shows the amount of CPU time required to decompress each of these data sets from the Release 6.06 and Release 6.07 compressed formats. Table 4 Comparison of CPU Usage while Decompressing Rles Data Set Release 6.06 Release 6.07 Percent Name Format Format Improvement ALL AVERAGE MISSING NONE The SAS data sets MISSING, ALL, and NONE illustrate extreme cases of decompression while the data set AVERAGE is an average case. In every case, decompression of the new compressed file format outperforms decompression of the Release 6.06 compressed file format. The performance improvement is most dramatic in the case with no compressible data, and it is not significant when the entire observation is compressible. All features available with the Release 6.06 compressed file format are available with the Release 6.07 compressed format. Release 6.07 can create both formats. A site that is sharing data between Releases 6.06 and 6.07 will want to use the 6.06 format, but all other sites will want the enhanced 6.07 format. For information on specifying the file format you want, refer to the next section, Specifying a File Format. Specifying a File Format Release 6.07 will produce new formats of compressed and noncompressed files. Although it uses the new formats by default, Release 6.07 can transparently read Release 6.06 formats. If you need to share SAS data files between Releases 6.06 and 6.07, you must force Release 6.07 to create file formats. Release 6.07 provides several ways of specifying the file format you want. The different methods offer you varying degrees of control. For example, your SAS Site Representative can set the default engine for the entire site, but you can specify a different default engine for your own SAS session. Table 5 shows the five ways you can specify a file format. Table 5 Specifying a File Format Option or Data Sets for which Argument Location Default Engine Is Set ENGINE~ site configuration file all data sets created at site ENGINE~ SAS invocation all data sets created during that SAS session V6061 V607 UBNAME statement all data sets in the specified library FILEFMT~ LlBNAME statement all data sets in the specified library FILEFMT~ data set option the one data set being opened For example, if a site representative wants to set the default file format for the entire site to the Release 6.06 format, he or she can add the following optidn to the site configuration file: engine:v606; Note that the format of the configuration file is system-dependent. See the SAS documentation for your operating system for details. A user who wants to set the default for a single SAS session can do so when invoking the SAS System. For instance, under MVS, the user starts the SAS System with the following command: sas options( engine:v606) Now, if the same user wants to set the default for a library to the Release 6.06 format, (s)he can do so with either of the following LlBNAME statements: libname perm '9xternal-file-name' filefmt~606; libname perm v606 'external-file-name'; Finally, a user who wants to create an individual file in the Release 6.06 format can use a SAS data set option: data perm. a (filefmt~606); NOTE: In Release 6.07, the FILEFMT= and ENGINE= options along with the name of an engine in the LlBNAME statement control the format used for new files. These options ale useful only for sites that need to read and write the same SAS data sets from both Releases 6.06 and No option is necessary for Release 6.07 to read and modify Release 6.06 data sets. 962

4 IMPROVED MEMORY USAGE IN THE DATA STEP In Release 6.06, a OAT A step that read several files required enough memory to hold the variable descriptor information for au of the files being read. Release 6.07 requires only enough memory to hold descriptors for the file with the most variables. Consequently, DATA steps that ran out of memory reading lots of files in Release 6.06 should run to completion in Release Fig~ ure 3 compares the amount of memory required 10 execute the following DATA step. This DATA step reads four SAS data sets with 1,000 variables each in Releases 5.18, 6.06, and data...null~; set ; Figure 3 Comparison of Memory Requirements kbytes 1494k *.* *., I aoo ***"'* +.,''''' I 36Sk "'*' 400 t...,' 176k "'*' I **t** *****,* * INDEXING PERFORMANCE Release Release 6.06 introduced indexes as a tool for tuning performance. In some applications, indexes have made a dramatic improvement. For details on the effects of using indexes, refer to "Effective Use of Indexes in the SAS System" in the Sixteenth Annual SAS Users Group International Conference Proceedings. In the interest of making a good thing better, Release 6.07 makes the following improvements to indexing: Indexes created by Release 6.07 are 20%-30% smaller than they were in Release Creating an index in Release 6.07 takes half the CPU time and half the 110 time compared to Release The algorithm for choosing an index for WHERE-clause optimization has been improved to take into account the BUFNO = option and file compression. Additional types of WHERE queries are optimized. Que')' Example SUBSTR functions where substr(lname,l,])~'smi' CONTAINS operations where Iname contains ('Smi') ; LIKE operations where Iname like ('%Rob_%) ; Truncated operators where lname gt: 'Sm' ; WHERE-CLAUSE PERFORMANCE Version 6 of the SAS System introduced the WHERE clause as a general method of subsetting a SAS data set. The WHERE clause is similar to the DATA step's subsetting IF statement, but it has several advantages over subsetting IF: You can use the WHERE clause outside of the DATA step with procedures on the PROC FSEDIT and PROC FSVIEW command lines in SCL programs as a data set option. The WHERE clause can be optimized with an index. The WHERE clause allows two more operators: 1 ike and contains. WHERE clauses without index optimization are not as fast as a subsetting IF statement in Release However, index-optimized WHERE clauses are generally much faster than subsetting IF statements. In Release 6.07, the unoptimized WHERE-clause performance matches that of the subsetting IF statement for most cases. Consider an example SAS data set with 500,000 observations and 1 variable: data a; do x= 1 to ; output; end; Now consider a simple and a complex query on this file: Example 1: Simple Query data...null; set a i if x'" 1; data.jlull; set a; where x" 1; Example 2: Complex Query data...null_; data...null; "t.; "t.; if x=1 where X= 1 x=3 x=3 x=5 x=5 1=7 x=7 1=9 " oc x",9 )(=11 X= 11 x=13 X= 13 x=15 ;" x,,15 ; " oc "' oc "' Table 6 compares these simple and complex queries as subsetting IF statements and as WHERE clauses in Release 6.06 and Table 6 Comparison of CPU Usage by the Subsetting IF Statement and the WHERE Clause Release 6.06 Release 6.07 Query INDEX INDEX Type IF WHERE WHERE IF WHERE WHERE simple complex For a simple query, the WHERE clause in Release 6.07 is more efficient than the subsetting IF statement. The complex query shows a lot of improvement between releases, but the WHERE clause is still slower than the subsetting IF statement for the complex query. Note that both the simple and the complex WHERE queries can be index optimized. An index-optimized WHERE clause wilt outperform a subsetting IF in all cases. In general, the WHERE clause is the recommended method for performing queries with the SAS System. For complex queries on large SAS data sets where CPU performance is critical, you may want to compare the performance of the subsetting IF statement and WHERE clause before deciding between the two. Flexibility and 963

5 {usually} better performance make the WHERE clause the better choice for most applications in Release SORTEDBY SUPPORT Release 6.07 stores a sort indicator with a SAS data file. The sort indicator expresses how the data are sorted. The SORT procedure automatically sets the SORTEDBY indicator when it finishes sorting a file. You can manually set the SORTEDBY indicator with the SORTEDBY = data set option. The sort indicator enhances the perfonnance of some applications by bypassing unnecessary sorts. ConSider an application that reads a SAS data set that is sometimes sorted. This application begins with the SORT procedure to ensure the data are sorted correctly. In Releases 5.18 and 6.06, this application incurs the overhead of sorting all the time. In Release 6.07, the SORT procedure recognizes that the file is already sorted and bypasses any unnecessary sorts. The value of the sort incicator is automatically synchronized with the data in a SAS data file. The SAS System turns off the sort indicator when you add a new observation to the SAS data file update an observation to change the value of one or more of the variables specified in the sort indicator; (updates that do not affect the sort order do not turn off the sort indicator) turn it off with the DATASETS procedure. Release 6.07 uses the sort indicator in the following situations: with the SORT procedure. PROe SORT does not sort a file that is already sorted. with certain types of Sal joins. These joins are optimized when the data are sorted. with the CONTENTS procedure. PROe CONTENTS reports the sort order. with the BY statement. The BY statement uses the sortindicator order instead of an index. CONCLUSIONS Release 6.07 of the SAS System provides all the enhancements of Release 6.06 plus additional capabilities, while matching or bettering the perfonnance of Release With Release 6.07, SAS Institute demonstrates its commitment to be on the leading edge of software technology without sacrificing performance or efficiency. REFERENCES Beatrous, Steve and Armstrong, Karen (1991), ueffective Use of Indexes in the SAS System," Proceedings of the Sixteenth Annual SAS Users Group International Conference, pp SAS and SAS/SHARE are registered trademarks of SAS Institute Inc. in the USA and other countries. " indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. 964

APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software

177 APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software Authors 178 Abstract 178 Overview 178 The SAS Data Library Model 179 How Data Flows When You Use SAS Files 179 SAS Data Files 179