Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

Similar documents
Optimizing System Performance

Contents of SAS Programming Techniques

Chapter 6: Modifying and Combining Data Sets

JUST PASSING THROUGH OR ARE YOU? DETERMINE WHEN SQL PASS THROUGH OCCURS TO OPTIMIZE YOUR QUERIES Misty Johnson Wisconsin Department of Health

Report Writing, SAS/GRAPH Creation, and Output Verification using SAS/ASSIST Matthew J. Becker, ST TPROBE, inc., Ann Arbor, MI

Working with Big Data in SAS

Hidden in plain sight: my top ten underpublicized enhancements in SAS Versions 9.2 and 9.3

Efficiently Join a SAS Data Set with External Database Tables

Using SAS to Analyze CYP-C Data: Introduction to Procedures. Overview

Uncommon Techniques for Common Variables

Performance Considerations

SAS File Management. Improving Performance CHAPTER 37

A Format to Make the _TYPE_ Field of PROC MEANS Easier to Interpret Matt Pettis, Thomson West, Eagan, MN

%MISSING: A SAS Macro to Report Missing Value Percentages for a Multi-Year Multi-File Information System

A Quick and Gentle Introduction to PROC SQL

Tweaking your tables: Suppressing superfluous subtotals in PROC TABULATE

Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

Introducing the SAS ODBC Driver

So Much Data, So Little Time: Splitting Datasets For More Efficient Run Times and Meeting FDA Submission Guidelines

General Tips for Working with Large SAS datasets and Oracle tables

Data Quality Review for Missing Values and Outliers

SAS ODBC Driver. Overview: SAS ODBC Driver. What Is ODBC? CHAPTER 1

Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX

10 The First Steps 4 Chapter 2

SAS Visual Analytics Environment Stood Up? Check! Data Automatically Loaded and Refreshed? Not Quite

A SAS Macro for Producing Benchmarks for Interpreting School Effect Sizes

A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys

Matt Downs and Heidi Christ-Schmidt Statistics Collaborative, Inc., Washington, D.C.

CDB/Auto-Online Unload CDB/Auto-Unload

The SERVER Procedure. Introduction. Syntax CHAPTER 8

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

Bruce Gilsen, Federal Reserve Board

Version 8 Base SAS Performance: How Does It Stack-Up? Robert Ray, SAS Institute Inc, Cary, NC

Summarizing Impossibly Large SAS Data Sets For the Data Warehouse Server Using Horizontal Summarization

Effective ways of handling various file types and importing techniques using SAS 9.4

Quality Control of Clinical Data Listings with Proc Compare

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

Preserving your SAS Environment in a Non-Persistent World. A Detailed Guide to PROC PRESENV. Steven Gross, Wells Fargo, Irving, TX

Tales from the Help Desk 6: Solutions to Common SAS Tasks

Using SAS/SHARE More Efficiently

Technical Paper. Accessing a Microsoft SQL Server Database from SAS under Microsoft Windows

Table of Contents. The RETAIN Statement. The LAG and DIF Functions. FIRST. and LAST. Temporary Variables. List of Programs.

Paper Haven't I Seen You Before? An Application of DATA Step HASH for Efficient Complex Event Associations. John Schmitz, Luminare Data LLC

Interleaving a Dataset with Itself: How and Why

Paper HOW-06. Tricia Aanderud, And Data Inc, Raleigh, NC

Loading Data. Introduction. Understanding the Volume Grid CHAPTER 2

... ) city (city, cntyid, area, pop,.. )

IF there is a Better Way than IF-THEN

PharmaSUG 2018 Paper AD-08 Parallel Processing Your Way to Faster Software and a Big Fat Bonus: Demonstrations in Base SAS. Troy Martin Hughes

TOP 10 (OR MORE) WAYS TO OPTIMIZE YOUR SAS CODE

SAS Application Development Using Windows RAD Software for Front End

CREATING A SUMMARY TABLE OF NORMALIZED (Z) SCORES

Essential ODS Techniques for Creating Reports in PDF Patrick Thornton, SRI International, Menlo Park, CA

What's the Difference? Using the PROC COMPARE to find out.

Know What You Are Missing: How to Catalogue and Manage Missing Pieces of Historical Data

Scalable Performance Data Server (SPDS) Steve Morton UK Technology Manager SAS Institute

CHAPTER 7 Examples of Combining Compute Services and Data Transfer Services

Defining Your Data Sources

Statistics, Data Analysis & Econometrics

Beginning Tutorials. Introduction to SAS/FSP in Version 8 Terry Fain, RAND, Santa Monica, California Cyndie Gareleck, RAND, Santa Monica, California

Checking for Duplicates Wendi L. Wright

Introduction to PROC SQL

Using the SQL Editor. Overview CHAPTER 11

Different Methods for Accessing Non-SAS Data to Build and Incrementally Update That Data Warehouse

If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC

SAS System Powers Web Measurement Solution at U S WEST

Creating Macro Calls using Proc Freq

BASICS BEFORE STARTING SAS DATAWAREHOSING Concepts What is ETL ETL Concepts What is OLAP SAS. What is SAS History of SAS Modules available SAS

Efficient Processing of Long Lists of Variable Names

A SAS Solution to Create a Weekly Format Susan Bakken, Aimia, Plymouth, MN

Applications Development

DSCI 325: Handout 9 Sorting and Options for Printing Data in SAS Spring 2017

Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX

ODS in an Instant! Bernadette H. Johnson, The Blaze Group, Inc., Raleigh, NC

Know Thy Data : Techniques for Data Exploration

Validation Summary using SYSINFO

Procedures. Calls any BMDP program to analyze data in a SAS data set

INTRODUCTION TO SAS HOW SAS WORKS READING RAW DATA INTO SAS

Using a Fillable PDF together with SAS for Questionnaire Data Donald Evans, US Department of the Treasury

Configuring SAS Web Report Studio Releases 4.2 and 4.3 and the SAS Scalable Performance Data Server

The Dataset Diet How to transform short and fat into long and thin

Amie Bissonett, inventiv Health Clinical, Minneapolis, MN

Identifying Duplicate Variables in a SAS Data Set

Accessibility Features in the SAS Intelligence Platform Products

Data Acquisition and Integration

Overview of Data Management Tasks (command file=datamgt.sas)

More with SQL Queries. Advanced SAS Programming

SAS. Information Map Studio 3.1: Creating Your First Information Map

Locking SAS Data Objects

SQL Metadata Applications: I Hate Typing

Can you decipher the code? If you can, maybe you can break it. Jay Iyengar, Data Systems Consultants LLC, Oak Brook, IL

Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

Multi-Threaded Reads in SAS/Access for Relational Databases Sarah Whittier, ISO New England, Holyoke, MA

From Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX

Turbo charging SAS Data Integration using SAS In-Database technologies Paul Jones

David Beam, Systems Seminar Consultants, Inc., Madison, WI

A Cross-national Comparison Using Stacked Data

Transcription:

Paper 76-28 Comparative Efficiency of SQL and Base Code When Reading from Database Tables and Existing Data Sets Steven Feder, Federal Reserve Board, Washington, D.C. ABSTRACT In this paper we compare SQL code to BASE code, looking at both DATA step and PROC code. We make these comparisons reading data from both DB2 tables and existing SAS data sets. INTRODUCTION Running SAS Version 8.1 on MVS, we start by reading data sets and build to summarizing data. Overall we find that SQL code performs more efficiently than BASE when reading DB2 tables, but PROC s can be superior for summarizations when reading existing data sets. We also find the same results when running SAS on NT, reading the same DB2 tables and data sets. To exclude the effect of other programs running on the system, the comparisons are limited to CPU time. Elapsed times are noted as a precaution against major differences in non-cpu time affecting the directions of results. READING DATA FROM DB2 Starting with the basics and reading 1,000,000 observations from a DB2 table, there was no difference between using SQL and DATA step code. options obs=1000000; libname in db2 ssid=dsn authid=cra; data sasout.crt_msa_area; create table sasout.crt_msa_area as select * from in.crt_msa_area; Time (Minutes:Seconds) SQL 01:23.0 01:37.0 DATA STEP 01:23.4 01:33.6 Similarly, using a WHERE clause, there was no difference when subsetting with an IN operator for 6 values with 375,408 resulting observations, or when subsetting for one value with 110 observations resulting, all read again from the DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency. libname in db2 ssid=dsn authid=cra; data sasout.crt_msa_area_2; where state_code in(26,27,29,36,37,39); create table sasout.crt_msa_area_2 as select * from in.crt_msa_area where state_code in(26,27,29,36,37,39); data sasout.crt_msa_area_2; where county_code=28; create table sasout.crt_msa_area_2 as select * from in.crt_msa_area where county_code=28; Time (Minutes:Meconds) Where 6 Values SQL 00:39.7 00:47.9 DATA STEP 00:39.6 00:52.1 Where 1 Value SQL 00:04.3 00:07.1 DATA STEP 00:04.3 00:07.0 READING DATA FROM DATA SETS Reading from an existing SAS data set, DATA step code showed a slight advantage in all 3 of the cases tested above, when reading 1,000,000 observations without subsetting, and in the 2 subsetting clauses. From this, we conclude that for a simple read of an existing data set, DATA step code is slightly preferable. All of the code was the same as above except that the LIBREF with the DB2 engine was deleted, and a DD statement to an existing SAS data set used instead, with the following results:

SQL 00:06.8 00:22.6 DATA STEP 00:05.3 00:23.9 SQL 00:04.5 00:15.7 DATA STEP 00:04.1 00:16.9 SQL 00:01.8 00:13.6 DATA STEP 00:01.6 00:09.1 set temp; UNIQUE COMBINATIONS OF VARIABLES After establishing first that there was no difference between SQL and DATA step code in reading a second table subset for 354318 observations, simple summary reporting of all 8280 unique combinations of two variables was tested. For SQL, this was a select distinct of the two variables. create table temp2 as select distinct fcex8577, fcex8578 from in.cuv_fcex01 For BASE code, a number of variations were tested. A DATA step with a BY statement selected the unique combinations, of course preceded by a PROC SORT, which was preceded by a DATA step initially reading the table from DB2. (See table entry for METHOD 1.) data temp; set in.cuv_fcex01(keep=fcex8577 fcex8578); proc sort; set temp; The SQL was dramatically more efficient, requiring less than a third of the time. Reading the PROC SORT directly from DB2 and skipping the first DATA step did not help. (See table entry for METHOD 2.) proc sort data=in.cuv_fcex01 (where=(fcex8577 <=50) keep=fcex8577 fcex8578) out=temp; METHOD 1 DATA STEP 00:30.0 00:33.5 SORT 00:02.7 00:05.3 DATA STEP 00:01.5 00:02.1 TOTAL 00:34.2 00:40.9 METHOD 2 SORT 00:33.6 00:36.5 DATA STEP 00:01.5 00:02.2 TOTAL 00:35:1 00:38:8 But, surprisingly for those of us less active on the database side, when reading from a table with a BY statement, no prior PROC SORT is required. However, this one DATA step, comparable to the single SQL step in coding complexity, though marginally superior to the other DATA step methods, still was outperformed by the SQL code above by a factor of four. libname in db2 ssid=dsn authid=fdrp; set in.cuv_fcex01(keep=fcex8577 fcex8578); DATA STEP 00:34.0 00:37.3 Since the tab values (not the frequency values) of a PROC FREQ on the DB2 table produce the same summary of unique combinations, this method was also tested. Interestingly, though the PROC FREQ was marginally superior to the DATA step, it still lost to the SQL by a factor of 2

three. Where a simple read with a select distinct is sufficient, SQL should be the code of choice. proc freq data=in.cuv_fcex01 (keep=fcex8577 fcex8578) noprint; tables fcex8577*fcex8578/out=temp2 (keep=fcex8577 fcex8578); FREQ 00:30.6 00:32.4 As an aside it should be noted that, when sorting directly from DB2, if the subsetting is done with a where statement rather than as an option on the input table, the penalty is huge, multiplying processing time by a factor of three. When selecting unique combinations from an existing data set, the SQL code and BASE (SORT, then DATA step) code performed similarly, but the PROC FREQ took significantly less time. This points in the direction of favoring SQL when working with DB2 data, but identifying the right PROC for existing datasets. SQL 00:05.3 00:08:1 SORT 00:03.6 00:06.4 DATA STEP 00:01.5 00:02.2 TOTAL 00:05.1 00:08.6 FREQ 00:03.3 00:04.4 SUMMARIZING A VARIABLE For summing a variable over unique combinations of two other variables, SQL code again outperformed BASE code by a factor of 3 when reading from DB2. Running a PROC MEANS marginally outperformed DATA step code. But when reading from an existing dataset, SQL code was only slightly more efficient than a PROC SORT and DATA step, and both were outperformed by the PROC MEANS by a factor of 2. So SQL code should be the code of choice when reading from DB2 and performing a summing operation at the same time, but a PROC MEANS is the gold standard for summing from an existing data set. /* SQL */ create table temp2 as select distinct fcex8577, fcex8578, sum(fcex8600) as sum8600 from in.cuv_fcex01 where fcex8577 <=10 group by fcex8577, fcex8578; /* DATA STEP */ proc sort data=in.cuv_fcex01; data temp; set in.cuv_fcex01(keep=fcex8577 fcex8578 fcex8600); where fcex8577 <=10; retain sum8600; if first.fcex8578 then sum8600=0; sum8600=sum8600+fcex8600; if last.fcex8578 then do; output temp; end; drop sum8600; /* PROC MEANS */ proc means data=in.cuv_fcex01 noprint; class fcex8577 fcex8578; var fcex8600; output out=temp2(where=(_type_=3)) sum=sum8600; where fcex8577 <=10; Summarize a Variable FROM DB2 SQL 00:06.9 00:08.9 DATA STEP 00:29.6 00:35.8 MEANS 00:27.0 00:31.6 FROM DATASET SQL 00:04.5 00:08.8 BASE CODE SORT 00:03.5 00:07.0 DATA STEP 00:01.8 00:02.8 TOTAL 00:05.3 00:09.8 MEANS 00:02.5 00:03.5 RESULTS ON NT 3

The above series of tests was also run on a Windows NT installation of SAS Version 8.1, reading the same DB2 tables. The only change was in the LIBNAME statement, now using the ACCESS to ODBC engine to access a driver for the DB2 server: libname in odbc complete="dsn=m1db2p; Uid=xxxxx;pwd=xxxxx" schema=fdrp; While the magnitude of the differences varied, in all cases, the comparisons between coding techniques paralleled those on MVS. Reading from DB2 SQL 00:35.4 01:04.8 DATA STEP 00:33.6 01:06.9 SQL 00:13.3 00:32.7 DATA STEP 00:12.7 00:34.1 SQL 00:00.0 00:07.0 DATA STEP 00:00.0 00:07.1 Reading from Existing Data Set SQL 00:03.3 00:17.9 DATA STEP 00:02.5 00:17.4 SQL 00:02.5 00:13.4 DATA STEP 00:02.5 00:13.5 SQL 00:01.5 00:10.8 DATA STEP 00:01.6 00:10.8 from DB2 SQL 00:00.3 00:11.1 METHOD 1 DATA STEP 00:10.0 00:26.2 SORT 00:01.9 00:08.9 DATA STEP 00:00.1 00:00.2 TOTAL 00:12.0 00:35.3 METHOD 2 SORT 00:09.7 00:31.8 DATA STEP 00:00.4 00:00.4 TOTAL 00:10.1 00:32.2 FREQ 00:09.5 00:24.3 from Existing Data Set SQL 00:04.6 00:26.8 SORT 00:03.8 00:27.3 DATA STEP 00:00.3 00:00.6 TOTAL 00:04.1 00:27.9 FREQ 00:03.2 00:23.1 Summarizing a Variable Summarize a Variable FROM DB2 SQL 00:00.2 00:08.3 DATA STEP 00:07.4 00:25.7 MEANS 00:07.6 00:25.8 FROM DATASET SQL 00:02.7 00:06.3 BASE CODE SORT 00:01.5 00:05.0 DATA STEP 00:00.8 00:02.0 TOTAL 00:02.3 00:07.0 MEANS 00:00.4 00:02.0 CONCLUSION Comparisons between SQL and DATA step code may vary based on platform, operating systems, DB2 tuning, and other factors. Small differences in the results desired can produce much larger differences in the needed code, with corresponding larger differences in comparative efficiency. SQL was similar or superior in the environment tested for reading and simple summarizing from DB2. The PROC s tested were superior for simple summarizing from existing data sets. These conclusions are best seen not as an ending point but as an approach to benchmarking for an actual application. They may best serve as a quide to what to test--when limits on testing demand efficiency in benchmarking as well as in coding. TRADEMARK INFORMATION SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. 4

CONTACT INFORMATION Steven Feder Federal Reserve Board, Mail Stop 157 Washington, D.C. 20551 202-452-3144 email: steven.h.feder@frb.gov 5