Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians

Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians ABSTRACT Karthik Chidambaram, Senior Program Director, Data Strategy, Genentech, CA This paper will provide tips and techniques for the analysts & statisticians to optimize the data processing routines in their day-to-day work. Quite a bit of productivity is lost on slow SAS servers and slow response time from IT teams. However, there are certain tools and techniques, that the analysts can do on their end, to bypass the inefficiencies. This paper will provide a list of those techniques & share the experience on utilizing the SAS GRID architecture. Key sections of the paper: 1. Tips and techniques to optimize the SAS programs, to bypass the bottlenecks 2. Hidden gems: quick tips to administer & optimize parameters to enhance processing huge volumes of data 3. GRID: A quick primer on GRID (from an analyst/statistician perspective) and its advantages TIPS AND TECHNIQUES TO OPTIMIZE SAS PROGRAMS TO BYPASS BOTTLENECKS OPTIMIZING WINDOWS MACHINE FOR PROCESSING YOUR PROGRAMS: In many cases, the servers or machines underperform and the blame is mostly placed on the SAS system. However, there are instances where, the back end system could be optimized to better serve the analytics. For instance, under Windows 7, follow these steps to optimize application performance: Open the Control Panel Click System and Security Select the System Click Advanced system settings task Select the Advanced tab In the Performance box, click Settings and then select the Advanced tab To optimize performance of an interactive SAS session, select Programs To optimize performance of a batch SAS session, select Background services Click OK This optimization ensures that the memory and page files are appropriately optimized for the type of SAS processing we use. This helps with the stability and memory processing of the server/pc to a greater extent. Irrespective of the type of windows machine used, the optimization listed above could be accomplished (even though the navigation path may be slightly different) USING HIGHLY RECURSIVE PROCESS WITH MODERATE SIZED DATASETS? CONSIDER MEMLIB OR MEMCACHE With MEMLIB and MEMCACHE options, we will be able to create Memory-based libraries. Using memory based libraries reduce the I/O to and from the disk. Especially, if our permanent library is on a SAN, we will see a substantial processing improvement with MEMLIB option. Memory based libraries can be used in several ways: 1. As a storage for the work library 2. Processing SAS libraries with high I/O 3. Cache for very large SAS libraries CHECK THE ASSIGNMENT OF THE SAS WORK LIBRARY Especially in server based SAS processing, there is always an increasing need for additional space on the work server. When the number of users or the processing database size increases, the size of the workspace is increased correspondingly. In most cases, this impacts the performance of the system. SAS processes are I/O intensive and utilize the work library for storing the temporary files. There are 2 common issues with SAS work library set up: 1. Size of the work folder 2. Network connectivity to work folder from the server 1

Work around: Check the SAS work library assignment using the proc datasets. Check for I/O issues by switching on the FULLSTIMER option. If you notice I/O issues, try to define a different location using saswork option at runtime or by modifying the SAS work assignment on autoexec.sas. OPTIMIZE YOUR CODE Many times, a simple change to the code could result in huge efficiency gain. A quick look at some of the efficient SAS coding options: If we would be reading a flat file multiple times, it will be a better option to create a SAS dataset. Reading a SAS dataset will be much faster than reading from a flat file. When using arrays in long programs, where the content generated in the DATA step are not intended for output to the result dataset, ensure addition of _TEMPORARY_. This will release the memory after the processing is complete. To reduce the I/O ensure that filters are done at the beginning of the code, especially while dealing with huge volumes of data. Even while filtering, a combination of where statement and keep statements could result in additional performance gains. SAS program data vector allocates buffer space based on the number of variables that are being read in and the number of variables that are created during the data step processing. Hence, if we are using 4 variables, out of 10 from a dataset, the keep statement at the set statement is more efficient than the keep statement at the end of the program. This is because, the keep option, when used with the set statement, avoids reading in the unwanted columns on to the buffer. Less Efficient Code: DATA sample; Efficient Code: DATA sample; Other SAS Statements keep var1 var2 var3; SET source (keep = var1 var2 var3); Other SAS Statements Both if and where statements can be used to subset a dataset based on the specified criteria. Though both if and where statements produce the exact same results in most cases, they have a big difference in the way they operate on the data. In case of the if statement, the data is read into the program data vector before the condition is verified. Thus all the records are read into the program data vector irrespective of their value and the criteria. On the contrary, the where statement checks for the criteria, even before the data is read into the PDV. Hence, the unwanted data records are not read in to the buffer space at all. Thus the Where statement will be a better option for data subset, especially in case of datasets with a large number of variables. Less Efficient Code: DATA subst; Efficient Code: DATA subst; If sales > 1000; Where sales > 1000; 2

HIDDEN GEMS: QUICK TIPS TO ADMINISTER & OPTIMIZE PARAMETERS TO ENHANCE PROCESSING HUGE VOLUMES OF DATA Many SAS users do not adjust the SAS System options and work with the default setting on the system. There are several hundreds of such options and it is virtually impossible to master the right setting for each of these parameters. This section will highlight a few interesting parameters, that may offer huge performance benefit to the users. BUFNO=, BUFSIZE=, CATCACHE=, AND COMPRESS= SYSTEM OPTIONS BUFNO: SAS uses the BUFNO= option to adjust the number of open page buffers when it processes a SAS data set. Increasing this option's value can improve our application's performance by allowing SAS to read more data with fewer passes; however, when memory usage increases. Experiment with different values for this option to determine the optimal value for our needs. Note: We can also use the CBUFNO= system option to control the number of extra page buffers to allocate for each open SAS catalog BUFSIZE: When the Base SAS engine creates a data set, it uses the BUFSIZE= option to set the permanent page size for the data set. The page size is the amount of data that can be transferred for an I/O operation to one buffer. The default value for BUFSIZE= is determined by operating system environment. Note that the default is set to optimize the sequential access method. To improve performance for direct (random) access, we should change the value for BUFSIZE. Whether we use our operating environment's default value or specify a value, the engine always writes complete pages regardless of how full or empty those pages are. If we know that the total amount of data is going to be small, we can set a small page size with the BUFSIZE= option, so that the total data set size remains small and we minimize the amount of wasted space on a page. In contrast, if we know that we are going to have many observations in a data set, we should optimize BUFSIZE= so that as little overhead as possible is needed. Note that each page requires some additional overhead. Large data sets that are accessed sequentially benefit from larger page sizes because sequential access reduces the number of system calls that are required to read the data set. Note that because observations cannot span pages, typically there is unused space on a page. CATCACHE: SAS uses this option to determine the number of SAS catalogs to keep open at one time. Increasing its value can use more memory, although this might be warranted if our application uses catalogs that will be needed relatively soon by other applications. (The catalogs closed by the first application are cached and can be accessed more efficiently by subsequent applications.) COMPRESS: One further technique that can reduce I/O processing is to store our data as compressed data sets by using the COMPRESS= data set option. However, storing our data this way means that more CPU time is needed to decompress the observations, as they are made available to SAS. But if our concern is I/O, and not CPU usage, compressing our data might improve the I/O performance of our application. SASFILE STATEMENT The SASFILE global statement opens a SAS data set and allocates enough buffers to hold the entire data set in memory. Once it is read, data is held in memory, available to subsequent DATA and PROC steps, until either a second SASFILE statement closes the file and frees the buffers or the program ends, which automatically closes the file and frees the buffers. Using the SASFILE statement can improve performance by Reducing multiple open/close operations (including allocation and freeing of memory for buffers) to process a SAS data set to one open/close operation Reducing I/O processing by holding the data in memory. If our SAS program consists of steps that read a SAS data set multiple times and we have an adequate amount of memory so that the entire file can be held in real memory, the program should benefit from using the SASFILE statement. Also, SASFILE is especially useful as part of a program that starts a SAS server such as a SAS/SHARE server. IBUFSIZE SYSTEM OPTION An index is an optional SAS file that we can create for a SAS data file in order to provide direct access to specific observations. The index file consists of entries that are organized into hierarchical levels, such as a tree structure, 3

and connected by pointers. When an index is used to process a request, such as for WHERE processing, SAS does a search on the index file in order to rapidly locate the requested records. Typically, we do not need to specify an index page size. However, the following situations could require a different page size: The page size affects the number of levels in the index. The more pages there are, the more levels in the index. The more levels, the longer the index search takes. Increasing the page size allows more index values to be stored on each page, thus reducing the number of pages (and the number of levels). The number of pages required for the index varies with the page size, the length of the index value, and the values themselves. The main resource that is saved when reducing levels in the index is I/O. If our application is experiencing a lot of I/O in the index file, increasing the page size might help. However, we must re-create the index file after increasing the page size. The index file structure requires a minimum of three index values to be stored on a page. If the length of an index value is very large, we might get an error message that the index could not be created because the page size is too small to hold three index values. Increasing the page size should eliminate the error. REUSE SYSTEM OPTION If space is reused, observations that are added to the SAS data set are inserted wherever enough free space exists, instead of at the end of the SAS data set. Specifying REUSE=NO results in less efficient usage of space if we delete or update many observations in a SAS data set. However, the APPEND procedure, the FSEDIT procedure, and other procedures that add observations to the SAS data set continue to add observations to the end of the data set, as they do for uncompressed SAS data sets. We cannot change the REUSE= attribute of a compressed SAS data set after it is created. Space is tracked and reused in the compressed SAS data set according to the REUSE= value that was specified when the SAS data set was created, not when we add and delete observations. Even with REUSE=YES, the APPEND procedure will add observations at the end. It may be worthwhile to check the default setting for this variable and set it to YES, especially in environments dealing with a lot of data updates. SAS GRID: A QUICK PRIMER ON GRID (FROM AN ANALYST/STATISTICIAN PERSPECTIVE) AND ITS ADVANTAGES SAS Grid Manager delivers grid computing capabilities, enabling organizations to create a managed, shared environment for processing large volumes of data and analytic programs. The grid effectively combines several servers, with dynamic load balancing abilities. From the shoes of an analyst, without the IT terms, the GRID manager avoids having a single server for shared pool of users, by combining a pool of CPUs and balancing the load across several machines, providing better performance and enhancing reliability. Some key benefits include: Automatically tailors SAS Data Integration Studio and SAS Enterprise Miner for parallel processing and job submission in a grid environment. Balances the load of many SAS Enterprise Guide users through easy submission to the grid. Provides load balancing for all SAS servers to improve throughput and response time of all SAS clients. Uses SAS Code Analyzer to analyze job dependencies in SAS programs and generates grid-ready code: Used by SAS Data Integration Studio and SAS Enterprise Guide to import SAS programs. Provides automated session spawning and distributed processing of SAS programs across a set of diverse computing resources. Speeds up processing of applicable SAS programs and applications, and provides more efficient computing resource utilization. Enables scheduling of production SAS workflows to be executed across grid resources: Ø Provides a process flow diagram to create SAS flows of one or more SAS jobs that can be simple or complex to meet our needs. Ø Uses all of the policies and resources of the grid. Enables many SAS solutions and user-written programs to be easily configured for submission to a grid of shared resources. Integrates with all SAS Business Intelligence clients and analytic applications by storing grid-enabled code as SAS Stored Processes. Provides greater resilience for mission-critical applications and high availability for the SAS environment. Includes command-line batch submission utility called SASGSUB: Ø Allows us to submit and forget, and reconnect later to retrieve results. Ø Enables integration with other standard enterprise schedulers. 4

Enables batch submission to leverage checkpoint and automatically restart jobs. Ø Applies grid policies to SAS workspace servers when they are launched through the grid. CONCLUSION This paper has highlighted the basic & easy rules for optimizing the SAS processing. With some minimal changes to our code, we can make sure that we process our programs in an effective and efficient manner, leveraging all the nice features in the SAS system. REFERENCES SAS Online Help, www.sas.com ACKNOWLEDGMENTS The Author would like to thank his family, friends, peers and supervisors for their encouragement, support and suggestions. CONTACT INFORMATION Karthikeyan Chidambaram - SAS certified professional, has over 15 years of experience in SAS in a variety of roles including SAS Administration, Statistical Analysis and ETL programming. Your comments and questions are valued and encouraged. Contact the author at: Karthikeyan Chidambaram Genentech Inc. 1 DNA Way South San Francisco, CA 94080 Phone: 805-300-0505 Email: karthihere@hotmail.com, Chidambaram.karthikeyan@gene.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. Indicates USA registration. Other brand and product names are trademarks of their respective companies. 5