Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians

Similar documents
Optimizing System Performance

Performance Considerations

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ

Chapter 12. File Management

Grid Computing in SAS 9.4

Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX

Paper Best Practices for Managing and Monitoring SAS Data Management Solutions. Gregory S. Nelson

APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software

Effective Usage of SAS Enterprise Guide in a SAS 9.4 Grid Manager Environment

SAS Studio: A New Way to Program in SAS

SAS File Management. Improving Performance CHAPTER 37

Submitting Code in the Background Using SAS Studio

SAS Scalable Performance Data Server 4.3

An Introduction to Parallel Processing with the Fork Transformation in SAS Data Integration Studio

SAS Factory Miner 14.2: Administration and Configuration

How to Optimize Jobs on the Data Integration Service for Performance and Stability

Maximizing SAS Software Performance Under the Unix Operating System

SAS Model Manager 15.1: Quick Start Tutorial

The DATA Statement: Efficiency Techniques

Qlik Sense Enterprise architecture and scalability

Dynamic Projects in SAS Enterprise Guide How to Create and Use Parameters

Ten tips for efficient SAS code

SAS Visual Analytics Environment Stood Up? Check! Data Automatically Loaded and Refreshed? Not Quite

Business Insight Authoring

Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide

SSIM Collection & Archiving Infrastructure Scaling & Performance Tuning Guide

The SERVER Procedure. Introduction. Syntax CHAPTER 8

Technical Paper. Performance and Tuning Considerations for SAS on Dell EMC VMAX 250 All-Flash Array

WRITE SAS CODE TO GENERATE ANOTHER SAS PROGRAM

An Introduction to Compressing Data Sets J. Meimei Ma, Quintiles

The Google File System

My SAS Grid Scheduler

SAS Solutions for the Web: Static and Dynamic Alternatives Matthew Grover, S-Street Consulting, Inc.

Adobe LiveCycle ES and the data-capture experience

Taking Advantage of the SAS System on the Windows Platform

Extending the Scope of Custom Transformations

SCSUG-2017 SAS Grid Job Search Performance Piyush Singh, Ghiyasuddin Mohammed Faraz Khan, Prasoon Sangwan TATA Consultancy Services Ltd.

Perform scalable data exchange using InfoSphere DataStage DB2 Connector

Shared File System Requirements for SAS Grid Manager. Table Talk #1546 Ben Smith / Brian Porter

Data Set Options CHAPTER 2

Data Set Options. Specify a data set option in parentheses after a SAS data set name. To specify several data set options, separate them with spaces.

Atlona Manuals Software AMS

High-availability services in enterprise environment with SAS Grid Manager

SAS Data Integration Studio 3.3. User s Guide

Automatic Detection of Section Membership for SAS Conference Paper Abstract Submissions: A Case Study

Massive Scalability With InterSystems IRIS Data Platform

SAS Environment Manager A SAS Viya Administrator s Swiss Army Knife

SAS Business Rules Manager 2.1

The Submission Data File System Automating the Creation of CDISC SDTM and ADaM Datasets

Transformer Looping Functions for Pivoting the data :

CS399 New Beginnings. Jonathan Walpole

Grid Computing in SAS 9.2. Second Edition

SAS Simulation Studio 14.1: User s Guide. Introduction to SAS Simulation Studio

Disks and I/O Hakan Uraz - File Organization 1

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

Paper Operating System System Architecture 9.2 Baseline and additional releases OpenVMS OpenVMS on Integrity 8.3 Solaris

An Oracle White Paper June Exadata Hybrid Columnar Compression (EHCC)

MAINVIEW Batch Optimizer. Data Accelerator Andy Andrews

SAS/Warehouse Administrator Usage and Enhancements Terry Lewis, SAS Institute Inc., Cary, NC

Divide and Conquer Writing Parallel SAS Code to Speed Up Your SAS Program

An Interactive GUI Front-End for a Credit Scoring Modeling System

RSA WebCRD Getting Started

BDM Hyperion Workspace Basics

Talend Open Studio for Data Quality. User Guide 5.5.2

Parallelizing Windows Operating System Services Job Flows

... IBM Power Systems with IBM i single core server tuning guide for JD Edwards EnterpriseOne

Is Your Data Viable? Preparing Your Data for SAS Visual Analytics 8.2

My Enterprise Guide David Shannon, Amadeus Software Limited, UK

Oracle SOA Suite Performance Tuning Cookbook

PhUSE Eric Brinsfield, Meridian Analytics and d-wise, Virginia Beach, VA, USA Joe Olinger, d-wise, Morrisville, NC, USA

SESUG 2014 IT-82 SAS-Enterprise Guide for Institutional Research and Other Data Scientists Claudia W. McCann, East Carolina University.

SoftPro 360 User Guide

ABSTRACT MORE THAN SYNTAX ORGANIZE YOUR WORK THE SAS ENTERPRISE GUIDE PROJECT. Paper 50-30

SAS IT Resource Management 3.8: Reporting Guide

Using the Horizon vrealize Orchestrator Plug-In

%DWFK$&&(66WR $'$%$6%$$ E\ 6WXDUW%LUFK IURP,QIRUPDWLRQ'HOLYHU\ 6\VWHPV6RXWK$IULFD

Using Metadata Queries To Build Row-Level Audit Reports in SAS Visual Analytics

SAS. Studio 4.1: User s Guide. SAS Documentation

Optimizing Performance for Partitioned Mappings

Introduction to IBM i Performance Data Investigator (PDI) Tool

PowerCenter 7 Architecture and Performance Tuning

Database Technology. Topic 7: Data Structures for Databases. Olaf Hartig.

IBM DB2 Control Center

Making the most of SAS Jobs in LSAF

Getting Started with CAPS

Best ETL Design Practices. Helpful coding insights in SAS DI studio. Techniques and implementation using the Key transformations in SAS DI studio.

Oracle Warehouse Builder 10g Runtime Environment, an Update. An Oracle White Paper February 2004

Using the Horizon vcenter Orchestrator Plug-In. VMware Horizon 6 6.0

I KNOW HOW TO PROGRAM IN SAS HOW DO I NAVIGATE SAS ENTERPRISE GUIDE?

Informatica PowerExchange for Microsoft Azure Blob Storage 10.2 HotFix 1. User Guide

SAS IT Resource Management Forecasting. Setup Specification Document. A SAS White Paper

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

All-Flash Storage Solution for SAP HANA:

OLAP Introduction and Overview

An Introduction to Big Data Formats

Data Grids in Business Rules, Decisions, Batch Scoring, and Real-Time Scoring

ArcGIS Server Performance and Scalability : Optimizing GIS Services

Increasing Performance for PowerCenter Sessions that Use Partitions

IBM InfoSphere Streams v4.0 Performance Best Practices

Transcription:

Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians ABSTRACT Karthik Chidambaram, Senior Program Director, Data Strategy, Genentech, CA This paper will provide tips and techniques for the analysts & statisticians to optimize the data processing routines in their day-to-day work. Quite a bit of productivity is lost on slow SAS servers and slow response time from IT teams. However, there are certain tools and techniques, that the analysts can do on their end, to bypass the inefficiencies. This paper will provide a list of those techniques & share the experience on utilizing the SAS GRID architecture. Key sections of the paper: 1. Tips and techniques to optimize the SAS programs, to bypass the bottlenecks 2. Hidden gems: quick tips to administer & optimize parameters to enhance processing huge volumes of data 3. GRID: A quick primer on GRID (from an analyst/statistician perspective) and its advantages TIPS AND TECHNIQUES TO OPTIMIZE SAS PROGRAMS TO BYPASS BOTTLENECKS OPTIMIZING WINDOWS MACHINE FOR PROCESSING YOUR PROGRAMS: In many cases, the servers or machines underperform and the blame is mostly placed on the SAS system. However, there are instances where, the back end system could be optimized to better serve the analytics. For instance, under Windows 7, follow these steps to optimize application performance: Open the Control Panel Click System and Security Select the System Click Advanced system settings task Select the Advanced tab In the Performance box, click Settings and then select the Advanced tab To optimize performance of an interactive SAS session, select Programs To optimize performance of a batch SAS session, select Background services Click OK This optimization ensures that the memory and page files are appropriately optimized for the type of SAS processing we use. This helps with the stability and memory processing of the server/pc to a greater extent. Irrespective of the type of windows machine used, the optimization listed above could be accomplished (even though the navigation path may be slightly different) USING HIGHLY RECURSIVE PROCESS WITH MODERATE SIZED DATASETS? CONSIDER MEMLIB OR MEMCACHE With MEMLIB and MEMCACHE options, we will be able to create Memory-based libraries. Using memory based libraries reduce the I/O to and from the disk. Especially, if our permanent library is on a SAN, we will see a substantial processing improvement with MEMLIB option. Memory based libraries can be used in several ways: 1. As a storage for the work library 2. Processing SAS libraries with high I/O 3. Cache for very large SAS libraries CHECK THE ASSIGNMENT OF THE SAS WORK LIBRARY Especially in server based SAS processing, there is always an increasing need for additional space on the work server. When the number of users or the processing database size increases, the size of the workspace is increased correspondingly. In most cases, this impacts the performance of the system. SAS processes are I/O intensive and utilize the work library for storing the temporary files. There are 2 common issues with SAS work library set up: 1. Size of the work folder 2. Network connectivity to work folder from the server 1

Work around: Check the SAS work library assignment using the proc datasets. Check for I/O issues by switching on the FULLSTIMER option. If you notice I/O issues, try to define a different location using saswork option at runtime or by modifying the SAS work assignment on autoexec.sas. OPTIMIZE YOUR CODE Many times, a simple change to the code could result in huge efficiency gain. A quick look at some of the efficient SAS coding options: If we would be reading a flat file multiple times, it will be a better option to create a SAS dataset. Reading a SAS dataset will be much faster than reading from a flat file. When using arrays in long programs, where the content generated in the DATA step are not intended for output to the result dataset, ensure addition of _TEMPORARY_. This will release the memory after the processing is complete. To reduce the I/O ensure that filters are done at the beginning of the code, especially while dealing with huge volumes of data. Even while filtering, a combination of where statement and keep statements could result in additional performance gains. SAS program data vector allocates buffer space based on the number of variables that are being read in and the number of variables that are created during the data step processing. Hence, if we are using 4 variables, out of 10 from a dataset, the keep statement at the set statement is more efficient than the keep statement at the end of the program. This is because, the keep option, when used with the set statement, avoids reading in the unwanted columns on to the buffer. Less Efficient Code: DATA sample; Efficient Code: DATA sample; Other SAS Statements keep var1 var2 var3; SET source (keep = var1 var2 var3); Other SAS Statements Both if and where statements can be used to subset a dataset based on the specified criteria. Though both if and where statements produce the exact same results in most cases, they have a big difference in the way they operate on the data. In case of the if statement, the data is read into the program data vector before the condition is verified. Thus all the records are read into the program data vector irrespective of their value and the criteria. On the contrary, the where statement checks for the criteria, even before the data is read into the PDV. Hence, the unwanted data records are not read in to the buffer space at all. Thus the Where statement will be a better option for data subset, especially in case of datasets with a large number of variables. Less Efficient Code: DATA subst; Efficient Code: DATA subst; If sales > 1000; Where sales > 1000; 2

HIDDEN GEMS: QUICK TIPS TO ADMINISTER & OPTIMIZE PARAMETERS TO ENHANCE PROCESSING HUGE VOLUMES OF DATA Many SAS users do not adjust the SAS System options and work with the default setting on the system. There are several hundreds of such options and it is virtually impossible to master the right setting for each of these parameters. This section will highlight a few interesting parameters, that may offer huge performance benefit to the users. BUFNO=, BUFSIZE=, CATCACHE=, AND COMPRESS= SYSTEM OPTIONS BUFNO: SAS uses the BUFNO= option to adjust the number of open page buffers when it processes a SAS data set. Increasing this option's value can improve our application's performance by allowing SAS to read more data with fewer passes; however, when memory usage increases. Experiment with different values for this option to determine the optimal value for our needs. Note: We can also use the CBUFNO= system option to control the number of extra page buffers to allocate for each open SAS catalog BUFSIZE: When the Base SAS engine creates a data set, it uses the BUFSIZE= option to set the permanent page size for the data set. The page size is the amount of data that can be transferred for an I/O operation to one buffer. The default value for BUFSIZE= is determined by operating system environment. Note that the default is set to optimize the sequential access method. To improve performance for direct (random) access, we should change the value for BUFSIZE. Whether we use our operating environment's default value or specify a value, the engine always writes complete pages regardless of how full or empty those pages are. If we know that the total amount of data is going to be small, we can set a small page size with the BUFSIZE= option, so that the total data set size remains small and we minimize the amount of wasted space on a page. In contrast, if we know that we are going to have many observations in a data set, we should optimize BUFSIZE= so that as little overhead as possible is needed. Note that each page requires some additional overhead. Large data sets that are accessed sequentially benefit from larger page sizes because sequential access reduces the number of system calls that are required to read the data set. Note that because observations cannot span pages, typically there is unused space on a page. CATCACHE: SAS uses this option to determine the number of SAS catalogs to keep open at one time. Increasing its value can use more memory, although this might be warranted if our application uses catalogs that will be needed relatively soon by other applications. (The catalogs closed by the first application are cached and can be accessed more efficiently by subsequent applications.) COMPRESS: One further technique that can reduce I/O processing is to store our data as compressed data sets by using the COMPRESS= data set option. However, storing our data this way means that more CPU time is needed to decompress the observations, as they are made available to SAS. But if our concern is I/O, and not CPU usage, compressing our data might improve the I/O performance of our application. SASFILE STATEMENT The SASFILE global statement opens a SAS data set and allocates enough buffers to hold the entire data set in memory. Once it is read, data is held in memory, available to subsequent DATA and PROC steps, until either a second SASFILE statement closes the file and frees the buffers or the program ends, which automatically closes the file and frees the buffers. Using the SASFILE statement can improve performance by Reducing multiple open/close operations (including allocation and freeing of memory for buffers) to process a SAS data set to one open/close operation Reducing I/O processing by holding the data in memory. If our SAS program consists of steps that read a SAS data set multiple times and we have an adequate amount of memory so that the entire file can be held in real memory, the program should benefit from using the SASFILE statement. Also, SASFILE is especially useful as part of a program that starts a SAS server such as a SAS/SHARE server. IBUFSIZE SYSTEM OPTION An index is an optional SAS file that we can create for a SAS data file in order to provide direct access to specific observations. The index file consists of entries that are organized into hierarchical levels, such as a tree structure, 3

and connected by pointers. When an index is used to process a request, such as for WHERE processing, SAS does a search on the index file in order to rapidly locate the requested records. Typically, we do not need to specify an index page size. However, the following situations could require a different page size: The page size affects the number of levels in the index. The more pages there are, the more levels in the index. The more levels, the longer the index search takes. Increasing the page size allows more index values to be stored on each page, thus reducing the number of pages (and the number of levels). The number of pages required for the index varies with the page size, the length of the index value, and the values themselves. The main resource that is saved when reducing levels in the index is I/O. If our application is experiencing a lot of I/O in the index file, increasing the page size might help. However, we must re-create the index file after increasing the page size. The index file structure requires a minimum of three index values to be stored on a page. If the length of an index value is very large, we might get an error message that the index could not be created because the page size is too small to hold three index values. Increasing the page size should eliminate the error. REUSE SYSTEM OPTION If space is reused, observations that are added to the SAS data set are inserted wherever enough free space exists, instead of at the end of the SAS data set. Specifying REUSE=NO results in less efficient usage of space if we delete or update many observations in a SAS data set. However, the APPEND procedure, the FSEDIT procedure, and other procedures that add observations to the SAS data set continue to add observations to the end of the data set, as they do for uncompressed SAS data sets. We cannot change the REUSE= attribute of a compressed SAS data set after it is created. Space is tracked and reused in the compressed SAS data set according to the REUSE= value that was specified when the SAS data set was created, not when we add and delete observations. Even with REUSE=YES, the APPEND procedure will add observations at the end. It may be worthwhile to check the default setting for this variable and set it to YES, especially in environments dealing with a lot of data updates. SAS GRID: A QUICK PRIMER ON GRID (FROM AN ANALYST/STATISTICIAN PERSPECTIVE) AND ITS ADVANTAGES SAS Grid Manager delivers grid computing capabilities, enabling organizations to create a managed, shared environment for processing large volumes of data and analytic programs. The grid effectively combines several servers, with dynamic load balancing abilities. From the shoes of an analyst, without the IT terms, the GRID manager avoids having a single server for shared pool of users, by combining a pool of CPUs and balancing the load across several machines, providing better performance and enhancing reliability. Some key benefits include: Automatically tailors SAS Data Integration Studio and SAS Enterprise Miner for parallel processing and job submission in a grid environment. Balances the load of many SAS Enterprise Guide users through easy submission to the grid. Provides load balancing for all SAS servers to improve throughput and response time of all SAS clients. Uses SAS Code Analyzer to analyze job dependencies in SAS programs and generates grid-ready code: Used by SAS Data Integration Studio and SAS Enterprise Guide to import SAS programs. Provides automated session spawning and distributed processing of SAS programs across a set of diverse computing resources. Speeds up processing of applicable SAS programs and applications, and provides more efficient computing resource utilization. Enables scheduling of production SAS workflows to be executed across grid resources: Ø Provides a process flow diagram to create SAS flows of one or more SAS jobs that can be simple or complex to meet our needs. Ø Uses all of the policies and resources of the grid. Enables many SAS solutions and user-written programs to be easily configured for submission to a grid of shared resources. Integrates with all SAS Business Intelligence clients and analytic applications by storing grid-enabled code as SAS Stored Processes. Provides greater resilience for mission-critical applications and high availability for the SAS environment. Includes command-line batch submission utility called SASGSUB: Ø Allows us to submit and forget, and reconnect later to retrieve results. Ø Enables integration with other standard enterprise schedulers. 4

Enables batch submission to leverage checkpoint and automatically restart jobs. Ø Applies grid policies to SAS workspace servers when they are launched through the grid. CONCLUSION This paper has highlighted the basic & easy rules for optimizing the SAS processing. With some minimal changes to our code, we can make sure that we process our programs in an effective and efficient manner, leveraging all the nice features in the SAS system. REFERENCES SAS Online Help, www.sas.com ACKNOWLEDGMENTS The Author would like to thank his family, friends, peers and supervisors for their encouragement, support and suggestions. CONTACT INFORMATION Karthikeyan Chidambaram - SAS certified professional, has over 15 years of experience in SAS in a variety of roles including SAS Administration, Statistical Analysis and ETL programming. Your comments and questions are valued and encouraged. Contact the author at: Karthikeyan Chidambaram Genentech Inc. 1 DNA Way South San Francisco, CA 94080 Phone: 805-300-0505 Email: karthihere@hotmail.com, Chidambaram.karthikeyan@gene.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. Indicates USA registration. Other brand and product names are trademarks of their respective companies. 5